<a href="https://colab.research.google.com/github/sugatoray/CodeSnippets/blob/master/src/Notebooks/DataSourceManager.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DataSourceManager

The class `DataSourceManager` is meant to provide a simple API for accessing different data-sources (`.csv` files) available at various locations by simply providing the url to the source. The data thus acquired is furnished as dataframes. 

For example, some of the following sources require some pruning while accessing the data from the urls. And you may not always remember, what exact transformation was necessary to access the data either. Now, consider adding 10-20 more data-sources and then the problem starts growing difficult for keeping track of what exactly is needed for each data-source (`.csv` file).
```python
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
movies = pd.read_csv('http://bit.ly/imdbratings')
orders = pd.read_csv('http://bit.ly/chiporders', sep='\t')
orders['item_price'] = orders.item_price.str.replace('$', '').astype('float')
stocks = pd.read_csv('http://bit.ly/smallstocks', parse_dates=['Date'])
titanic = pd.read_csv('http://bit.ly/kaggletrain')
ufo = pd.read_csv('http://bit.ly/uforeports', parse_dates=['Time'])
```

The objective of the class `DataSourceManager` is to relieve the user of any burden to recall the url(s) and the corresponding transformation(s), necessary to access a given data-source. Once the package is installed, the rest is designed to be intuitive and self explanatory. Although, for absolute clarity, we will have documentation available as well.

In [20]:
#@title `class DataSourceManager` { vertical-output: true }
import pandas as pd
from abc import ABC, abstractmethod
from functools import partial

#@markdown ### Print Imported Library Info 
def print_lib_info(libs=None, sep='/', sep_length=1, include_package=False):
    if libs is not None:
        lib_info = []
        separator = ''.join([sep]*sep_length)
        for lib in libs:
            #lib_name, lib_version = lib.__name__, lib.__version__
            lib_package = getattr(lib,'__package__', None)
            lib_name = getattr(lib,'__name__', None)
            lib_version = getattr(lib,'__version__', None)
            s = ''
            if include_package:
                s = '{}{}'.format(lib_package, separator)
            s += '{}: {}'.format(lib_name, lib_version)
            lib_info.append(s)
        print('\n'.join(lib_info))

#@markdown ---

libs = [pd]
print_lib_info(libs = libs)

#@markdown `class DataSourceManager`

class DataSources(ABC):

    def __init__(self):
        self.urls = None # type: Dict

    def fetch_all(): pass

    def __repr__(self):
        return '{} object.'.format(self.__class__.__name__)

class DataSource(ABC):

    def __init__(self):
        self.name: str = None
        self.url: str = None
        self.data = None
    
    #@abstractmethod    
    def fetch(self): pass

    def __repr__(self):
        #repr_data = 'None' if self.data is None else data.shape
        #repr_data =  ', data.shape: {}'.format(repr_data)        
        msg = '{}( name: {}, url: {} )' #+ f'{repr_data} )'
        return msg.format(self.__class__.__name__, self.name, self.url)

class DataSourceURLs(object):
    dataurls = {
        'drinks': 'http://bit.ly/drinksbycountry', 
        'movies': 'http://bit.ly/imdbratings', 
        'orders': 'http://bit.ly/chiporders', 
        'stocks': 'http://bit.ly/smallstocks', 
        'titanic': 'http://bit.ly/kaggletrain',
        'ufo': 'http://bit.ly/uforeports', 
    }

class DataSourceManager(DataSourceURLs):
    _DEFAULT_DATAFRAME_NAME = 'titanic'

    def __init__(self, fetch_all=False):
        super().__init__()
        self.ds = DataSources()
        self.datasources = self.ds
        self.ds.urls = self.dataurls.copy()
        self.ds.fetch_all = partial(self.fetch_all) # functools.partial
        for attr in self.ds.urls:          
            self.add_datasource(name = attr)              
        if fetch_all:
            self.fetch_all(refresh=False)
            
    def __repr__(self):
        s = '{}'.format(self.__class__.__name__)
        msg = ''
        for name, url in self.ds.urls.items():
            msg += f'\n {name}: {url}'
        msg = s + '(datasources: {})'.format(len(self.ds.urls)) + msg
        return msg

    def add_datasource(self, name=None, url=None):
        """Add a data source with a name and url.

        This adds another entry to self.df.urls and then makes downstream 
        updates. There is no change made to self.dataurls.
        """
        if (name is not None): 
            if ((name not in self.ds.urls) and (url is not None)):
                # add datasource to self.df.urls
                self.ds.urls.update({name: url})
            setattr(self.ds, name, SampleDataFrame())
            obj = getattr(self.ds, name)            
            obj.name = name
            obj.url = self.ds.urls.get(name)
            obj.fetch = partial(self.fetch, name = name) # functools.partial
            ## Alternate way of updating the attributes/methods
            # setattr(obj, 'name', name)
            # setattr(obj, 'url', self.dataurls.get(name))
            # setattr(obj, 'fetch', partial(self.fetch, name = name))

            
    def fetch_all(self, refresh=False):
        """Fetch all datasets.
        """
        for attr in self.ds.urls:
            #setattr(self.df, attr, self.fetch(name = attr))
            self.fetch(name = attr, refresh=refresh)

    def fetch(self, name=None, refresh=False): 
        """Fetch a dataset by name.        
        """
        if name is None: 
            name = self._DEFAULT_DATAFRAME_NAME 
        obj = getattr(self.ds, name)
        if refresh or (obj.data is None):
            setattr(obj,  
                    'data', 
                    self._get_data(name = name))
        #return self._get_data(name=name)

    @staticmethod
    def test_datasource(name=None, url=None):
        """Test if a datasource is accessible using 
        pandas.read_csv() method.
        """
        if url is not None:
            try:
                data = pd.read_csv(url)
                return data
            except Exception as e:
                print('Error: invalid url')
                print(e)
                
        

    def _get_data(self, name='titanic', url=None):
        """Fetches data for a datasource with known url or a new datsource from 
        a user-provided url.

        When url=None, a dictionary search returns the url from self.dataurls. 
        This url is then used to return a pandas.DataFrame.
        """
        if url is None:
            url = self.ds.urls.get(name)
        if name in ['drinks', 'movies', 'titanic']:
            data = pd.read_csv(url)
        elif name == 'orders':
            data = pd.read_csv(url, sep='\t')
            data['item_price'] = (data
                                    .item_price
                                    .str.replace('$', '')
                                    .astype('float'))
        elif name == 'stocks':
            data = pd.read_csv(url, parse_dates=['Date'])
        elif name == 'ufo':
            data = pd.read_csv(url, parse_dates=['Time'])
        else:
            data = pd.read_csv(url)
        return data



pandas: 1.0.3


## Instantiating `DataSourceManager`

Instantiate `DataSourceManager` and print the instance, `dsm`.

In [15]:
dsm = DataSourceManager()
print(dsm)

DataSourceManager(datasources: 6)
 drinks: http://bit.ly/drinksbycountry
 movies: http://bit.ly/imdbratings
 orders: http://bit.ly/chiporders
 stocks: http://bit.ly/smallstocks
 titanic: http://bit.ly/kaggletrain
 ufo: http://bit.ly/uforeports


## Basic Structure of `DataSourceManager`

The structure of `DataSourceManager` is as follows:  

```python
DataSourceManager (dsm)
    |---DataSources (df)
        |---List[DataSource]
            - drinks
            - movies
            - orders
            - stocks
            - titanic
                |---name # type: str
                |---url # type: str
                |---data # type: pandas.DataFrame
                |---fetch() # method
            - ufo
        |---fetch_all() #method
        |---urls # user-can-update | type: Dict
    |---fetch() # method
    |---fetch_all() # method
    |---add_datasource() # method
    |---test_datasource() # staticmethod
    |---dataurls # user-CANNOT-update | type: Dict
```


## Loading Data

To load data (`dsm.ds.titanic.data`) for any data-source, for instance, _**titanic**_, you have the following options:  

- `dsm.ds.titanic.fetch()`
- `dsm.fetch(name='titanic')`

To load all the data-sources (`dsm.dataurls`) to their respective locations (`dsm.df.<datasource-name>.data`), any of the following will work.  

- `dsm.ds.fetch_all()`
- `dsm.fetch_all()`

If you want to reload a data-source (say, _titanic_), you could do it using any of the follwoing two methods.  

- `dsm.ds.titanic.fetch(refresh = True)`
- `dsm.fetch(name = 'titanic', refresh = True)`

In [16]:
# All DataSources, Attributes and Methods under dsm.ds
[e for e in dir(dsm.ds) if not (e.startswith('__') or e.startswith('_'))]

['drinks', 'fetch_all', 'movies', 'orders', 'stocks', 'titanic', 'ufo', 'urls']

## Example

Here is an example of how to load data from data-source `titanic`.

In [22]:
dsm = DataSourceManager()
print(dsm)
dsm.ds.titanic.fetch()  # or, dsm.fetch(name='titanic', refresh=False) 
                        # NOTE: refresh=True >>> enables fresh download
dsm.ds.titanic.data.head()

DataSourceManager(datasources: 6)
 drinks: http://bit.ly/drinksbycountry
 movies: http://bit.ly/imdbratings
 orders: http://bit.ly/chiporders
 stocks: http://bit.ly/smallstocks
 titanic: http://bit.ly/kaggletrain
 ufo: http://bit.ly/uforeports


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## **Bonus Content**: Visualize Pandas Profile Report

You can use the library [pandas_profiling][#pandas-prfiling-github] to quickly visualize a profile-report for a given `pandas.DataFrame`. Here is a [YouTube Video][#pandas-prfiling-youtube] on _**how to use pandas_profiling**_.

[#pandas-prfiling-github]: https://github.com/pandas-profiling/pandas-profiling
[#pandas-prfiling-youtube]: https://www.youtube.com/watch?v=RlIiVeig3hc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=35&t=1574s


```python
# if not installed already
#   PIP:    pip install pandas_profiling
#   CONDA:  conda install -c anaconda pandas-profiling
import pandas_profiling as pdp
pdp.ProfileReport(df)
```