### NeurIPS Meetup Workshop

#### Title: ["Datasist”: A Python Library for Easy Data Analysis, Visualization and Modelling.](https://arxiv.org/abs/1911.03655)

<img src="dlogo.jpeg" width="300px" height="250px" alt='datasist logo'/>

### Presented by: [Rising Odegua](https://twitter.com/risin_developer)

#### Email: risingodegua@gmail.com
#### Phone: +234-8140299072

__________________________________
#### What is Datasist?
Datasist makes data analysis, visualization, cleaning preparation, and even modeling super easy for you during prototyping. 

Because let's face it I wouldn't want to do this...(Please look at the code block below)

```python
import pandas as pd

data = pd.read_csv('some_csv_file.csv)
                 
missing_percent = (data.isna().sum() / data.shape[0]) * 100
cols_2_drop = missing_percent[missing_percent.values >= 80].index
#Drop missing values
df = data.drop(cols_2_drop, axis=1)

```

...just because I want to drop columns with missing percentage greater than or equal to 80, when I can simply do this (Please look at the beauty below)

```python
import pandas as pd
import datasist as ds

data = pd.read_csv('some_csv_file.csv)
df = ds.drop_missing(data=data, percent=80)

```

*smiles, I know right, it's lazy, but efficient. 

The goal of datasist is to abstract repetitive and mundane codes, functions and techniques we use into simple, short functions and methods that can be called. Datasist was born out of sheer laziness, because let's face it unless you're a 10x data scientists, we all hate typing long, boring and mundane chunks of code to do the same thing over and over again. 

The design of datasist is currently centered around 6 modules, namely:

1. structdata
2. Feature Engineering 
3. timeseries
4. visualization
5. model

This is subject to change in future versions as we are currently working on support for many other areas in the field. 

The aim of this post is to introduce you to some of the most important features in each of these modules and how you can start using it in your projects. 
For this post to be short and concise, I have splitted it into two parts. 
__________________________________

What you will learn in this workshop:

* Working with the datasist structdata module.
* Feature engineering with datasist.
* Easy visualization with datasist.
* Testing and comparing machine learning models with datasist.

#### Installing datasist

```python
    pip install datasist
```
Remember to use the exclamation symbol of you're running the command from a Jupyter notebook. 

```python
    !pip install datasist
```

Next, you need to get a dataset to play with, you can use any dataset, but for consistency, you can download the dataset I used for this workshop [here](https://zindi.africa/competitions/data-science-nigeria-2019-challenge-1-insurance-prediction/)

In [None]:
import pandas as pd
import datasist as ds  #import datasist library
import numpy as np

train_df = pd.read_csv('train_data.csv')
test_df = pd.read_csv('test_data.csv')

### Working with the structdata module

The structdata module contains numerous functions for working with structured data mostly in the Pandas DataFrame format. That is, you can use the functions in this module to easily manipulate and analysis DataFrames. We highlight some of the functions available.

1. describe function: We all know the describe function in Pandas, well we decided to extend it to support full description of a dataset at a glance.

In [None]:
ds.structdata.describe(train_df)

From the result, you can have a full description and properly understand some of the important features of your dataset at a glance, and in one line. 

2. check_train_test_set: This function is used to check the sampling strategy of two dataset. This is important because if two dataset are not from the same distrbution, then the feature extraction will be different as we can not use calculations on the first set-say train- on the next dataset-say test. 

To use this, you pass in the datasets (train_df and test_df), a common index (customer_id) and finally any feature or column available in both dataset.

In [None]:
ds.structdata.check_train_test_set(train_df, test_df, index='Customer Id', col='Building Dimension')

3. display_missing: You can check for the missing values in your dataset and display the result in the well formated DataFrame.

In [None]:
ds.structdata.display_missing(train_df)

4. get_cat_feats and get_num_feats: Just like their names, you can retrieve categorical and numerical features respectively as a list. 

In [None]:
cat_feats = ds.structdata.get_cat_feats(train_df)
cat_feats

In [None]:
num_feats = ds.structdata.get_num_feats(train_df)
num_feats

5. get_unique_counts: Ever wanted to get the unique classes in your categorical features before you decide what encoding scheme to use? well, you can use the get_unique_count function to easily that.

In [None]:
ds.structdata.get_unique_counts(train_df)

6. join_train_and_test: Most of the time when prototyping, you may want to concatenate both train and test set, and then apply some transformations to them. You can use the join_train_and_test function for that. It returns a concatenated dataset and the size of the train and test data for splitting in the future

In [None]:
all_data, ntrain, ntest = ds.structdata.join_train_and_test(train_df, test_df)
print("New size of combined data {}".format(all_data.shape))
print("Old size of train data: {}".format(ntrain))
print("Old size of test data: {}".format(ntest))

#later splitting after transformations
train = all_data[:ntrain]
test = all_data[ntrain:]

Those are some of the popular functions in the structdata module of datasist, to see other functions and to learn more about the parameters you can tweak, check the [API documentation here](). 

### Feature engineering with datasist.

Feature engineering is the creation, manipu......

Some of the functions available in the feature_engineering module of datasist are:

NOTE: Functions in the feature_engineering module always returns a new and transformed DataFrame. This means, it always expects that you assign the result to a variable as nothing happens inplace. 

1. drop_missing: This function drops columns/features with a specified percentage of missing values. Assuming I have a set features with say 90 percent mssing values, I would want to drop these particular columns. I can do this with the drop_missing function. 


In [None]:
#first let's see the percentage of missing values
ds.structdata.display_missing(train_df)

Just for demonstration, we'll drop the column with 7.1 percent missing values.
Note: You do not want to be dropping a column/feature with so little missing values. What you should do is ti fill it, but we do this here, just for demonstration purposes. 

In [None]:
new_train_df = ds.feature_engineering.drop_missing(train_df, percent=7.0)
ds.structdata.display_missing(new_train_df)

2. drop_redundant: This function is used to remove features with low or no variance. That is features that contain the same class all through. We show a simple example using an artificial dataset. 

In [None]:
df = pd.DataFrame({'a': [1,1,1,1,1,1,1],
                  'b': [2,3,4,5,6,7,8]})

df

Now looking at the artificial dataset above, we see that column __a__ and __b__ are redundant, that is they have the same class all through. We can drop these columns automatically by just passing in the dataset to the drop_redundant function. 

In [None]:
df = ds.feature_engineering.drop_redundant(df)
df

3. convert_dtypes: This function takes a DataFrame and automatically type cast features that are not represented in their right types.Let's see an example.

In [None]:
data = {'Name':['Tom', 'nick', 'jack'],
        'Age':['20', '21', '19'], 
        'Date of Birth': ['1999-11-17','20 Sept 1998','Wed Sep 19 14:55:02 2000']}

df = pd.DataFrame(data)
df

In [None]:
df.dtypes

The features Age and Date of Birth are suppose to be integer and Datetime respectively, by passing this DataFrame to the convert_dtype function, this can be automatically fixed. 

In [None]:
df = ds.feature_engineering.convert_dtype(df)
df.dtypes

4. fill_missing_cats: As the name implies, this function takes a DataFrame, and automatically detect categorical columns with missing values. It fills them using the mode.

In [None]:
ds.structdata.display_missing(train_df)

From the dataset, we have two categorical features with missing values, these are Garden and Geo_Code. 

In [None]:
df = ds.feature_engineering.fill_missing_cats(train_df)
ds.structdata.display_missing(df)

5. fill_missing_nums: This is similar to the fill_missing_cats, except it works on numerical features and you can specify a filling strategy (mean, mode or median). 

From the dataset, we have two numerica features with missing values, these are Building Dimension and Date_of_Occupancy.

In [None]:
df = ds.feature_engineering.fill_missing_num(train_df)
ds.structdata.display_missing(df)

6. log_transform: This function can help you log transform a set of features, It can also display before and after plot with the level of skewness to help you decide if log transforming is what you want.

After visualization of some of the data set which we will study next, we found out that Building Dimension and Date_of_Occupancy are skewed. Let's use the log_transform function on them.

Note: Make sure your columns do not contain missing values, else it will throw and error. 


In [None]:
df = ds.feature_engineering.fill_missing_num(df)
df = ds.feature_engineering.log_transform(df, columns=['Building Dimension'])

To work with features like latitude and longitude, datasist has dedicated functions like bearing, manhattan_distance, get_location_center etc, available in the feature_engineering module. I'll leave that for you to explore. To see these and other functions, you can check the API documentation here.

### WORKING WITH TIME BASED FEATURES

Finally in this part, I'll talk about the timeseries module in datasist. The timeseries module contains functions for working with date time features. It can help you extract new features from Date features and help you visualize Date Features.

1. extract_dates: This function can extract specified features like day of the week, day of the year, hour, min and second of the day from a specified date feature. 
To demonstrate this, let's use a dataset that contains Date feature.

In [None]:
new_train = pd.read_csv("sendy_train.csv")
new_train.head(3).T

The dataset is logistic dataset, and contains lot's of time features which we can analyse. Let's demonstrate how easy it is to extract information from __Placement - Time__, __Arrival at Destination - Time__ features using the extract_dates function. 

In [None]:
cols = ['Placement - Time', 'Arrival at Destination - Time']
df = ds.timeseries.extract_dates(new_train, date_cols=cols)
df.head(3).T

Note: You can specify the features to return by changing the subset parameter. For instance, we could specify that we only want day of the week and hour.

In [None]:
cols = ['Placement - Time', 'Arrival at Destination - Time']
df = ds.timeseries.extract_dates(new_train, date_cols=cols, subset=['dow', 'hr'])
df.head(3).T

2. timeplot: The timeplot function can help you visualize a set feature against a particulae time feature. This can help you identify trends and patterns in these features. To use this function, you can pass a set of numerical cols, and then specify the Date feature you want to plot against.

In [None]:
num_cols = ['Time from Pickup to Arrival', 'Destination Long', 'Pickup Long','Platform Type', 'Temperature']
ds.timeseries.timeplot(new_train, num_cols=num_cols,
                       time_col='Placement - Time')

In [None]:
num_cols = ['Time from Pickup to Arrival', 'Destination Long', 'Pickup Long','Platform Type', 'Temperature']
ds.timeseries.timeplot(new_train, num_cols=num_cols,
                       time_col='Pickup - Time')

______________________________________________
### Easy visualization using datasist.

The visualization module is one of the strong areas of datasist. There are lots of functions that you can use to create aesthetic and colorful plots with minimal codes. In this post, I'll highlight some of the most import functions available in this module.

Note: All functions in the visualization module works at data scale not feature scale. This means, you can pass in the full dataset and it visualization every feature out of the box. You can also specify the features you want to plot.

#### VISUALIZATION FOR CATEGORICAL FEATURES

Visualization for numerical features include plots like scaterplot, histogram, kde plots etc. We can use the functions available in datasist to easily do this at data wide level. 

1. boxplot: This function makes a box plot of all numerical features against a specified categorical target column. 

Note: You can save a plot as a png file in the current folder by setting the save_fig parameter to True in any of the visualization function.

In [None]:
ds.visualizations.boxplot(train_df, target='Claim')

2. catbox: The catbox feature is used to make a side by side bar plot of all categorical features in a dataset against a specified categorical target. This can help in identifying causation and patterns and also identifying features that seperates the specified target properly.

Note: catbox would only plot categorical feature with a limited number of unique classes.Also, the target must be a categorical feature with a limited number of unique classes.

In [None]:
ds.visualizations.catbox(train_df, target='Claim')

3. countplot: The countplot simply makes a barplot of all categorical feature to show their class count. 

Note: You can specify specific features to plot else, it is automatically inferred. You can also specify a seperate by feature. 

In [None]:
ds.visualizations.countplot(train_df)

In [None]:
ds.visualizations.countplot(train_df, separate_by='Claim')

#### VISUALIZATION FOR NUMERICAL FEATURES

Visualization for numerical features include plots like scaterplot, histogram, kde plots etc. We can use the functions available in datasist to easily do this at data wide level. 

1. histogram: This function makes an histogran plot of all numerical features in a dataset. This Helps to show distribution of the features.

Note: To use this, the specified features to plot must not contain missing values, else it would throw an error.

In our example below, the features Building Dimension and Date_of_Occupancy both contain missing values. We could decide to fill this before plotting or we could pass in a list with these features removed.

I'll go with the first option, that is filling the missing values.

In [None]:
df = ds.feature_engineering.fill_missing_num(train_df)
ds.visualizations.histogram(df)

2. scatterplot: This function makes a scatter plot of all numerical features in a dataset against a numerical target. It helps to show the correlation between features.

In [None]:
feats = ['Insured_Period',
         'Residential',
         'Building Dimension',
         'Building_Type',
         'Date_of_Occupancy']

ds.visualizations.scatterplot(train_df,num_features=feats, target='Building Dimension')

5. plot_missing: As the name implies, this function can be used to visualize the missing values in a dataset. White cells indicate missing and dark cells indicate full. The color range at the right hand corner shows intensity values. 

In [None]:
ds.visualizations.plot_missing(train_df)

____________________________________________________________________
#### Testing and comparing machine learning models with datasist

The __model__ module contains functions and methods for testin and comparing machine learning models. Current version of datasist only supports scikit-learn models. Tensorflow and Pytorch models will be supported soon. 
I'll highlight some of the important functions in this model, and also show you how you can use the metrics visualization functions in the visualization module along side. 

To demostrate these functions, we'll use a dataset from the Data Science Nigeria, 2019 BootCamp available here. The task is to predict insurance claim (1=Claim, 0=No Claim) from building observations. 
We'll do some basic data preprocessing and prepare the data for modeling.
Note: The goal of this analysis is to demonstrate how to use the model module, so we would not be doing any heavy feature engineering. 

In [None]:
pd.set_option('display.max_colwidth', 400)
train = pd.read_csv('train_data.csv')
test = pd.read_csv('test_data.csv')
vardef = pd.read_csv("variabledef.csv")

In [None]:
vardef

In [None]:
#drop the id column
train.drop(columns='Customer Id', axis=1, inplace=True)
test.drop(columns='Customer Id', axis=1, inplace=True)

#fill missing values
train = ds.feature_engineering.fill_missing_cats(train)
train = ds.feature_engineering.fill_missing_num(train, method='mean')

test = ds.feature_engineering.fill_missing_cats(test)
test = ds.feature_engineering.fill_missing_num(test, method='mean')

ds.structdata.display_missing(train)

Now we have properly filled dataset, next we'll encode all categorical features using either label encoding, or one hot encoding depending on the number of unique classes. 

In [None]:
#check the unique classes in each categorical feature
ds.structdata.class_count(train)

We will label encode Geo_Code, since the unique classes is large, and one-hot-encode the rest.

In [None]:
import category_encoders as ce

# drop target column
target = train['Claim'].values
train.drop(columns='Claim', axis=1, inplace=True)

enc = ce.OrdinalEncoder(cols=['Geo_Code'])
enc.fit(train)
train_enc = enc.transform(train)
test_enc = enc.transform(test)


#one-hot-encode the rest categorical features
hot_enc = ce.OneHotEncoder()
hot_enc.fit(train_enc)
train_enc = hot_enc.transform(train_enc)
test_enc = hot_enc.transform(test_enc)

In [None]:
train_enc.head()

In [None]:
print("Shape of train data after encoding: {}".format(train_enc.shape))
print("Shape of test data after encoding: {}".format(test_enc.shape))

1. compare_model: This model takes as argument multiple machine learning models and returns a plot of the metric. This can be used to pick a base model for and also to compare models side by side. The compare model returns a tuple of the trained models and their score in case you want to make predictions with the best model.

Now let's compare some classification model. We'll compare RandomForest, LightGBM and XGBoost models. 

Note: We won't be performing any advance hyperparameter tuning in this session, as the goal is to show you how to use the functions and not extensive hyperparameter tunings. 

Also, you will have to install lightgbm and xgboost before you can try this part. Alternatively, you can use the default models in scikit-learn. To install lightgbm go here and to install xgboost, go here.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import xgboost as xgb

Xtrain, Xtest, ytrain, ytest = train_test_split(train_enc, target, test_size=0.3, random_state=1)
rf_classifier = RandomForestClassifier(n_estimators=20, max_depth=4)
lgb_classifier = lgb.LGBMClassifier(n_estimators=20, max_depth=4)
xgb_classifier = xgb.XGBClassifier(n_estimators=20, max_depth=4)

In [None]:
classifiers = [rf_classifier, lgb_classifier, xgb_classifier]
models, scores = ds.model.compare_model(models_list=classifiers, x_train=Xtrain, y_train=ytrain, scoring_metric='accuracy')

From this sample analysis, the LGBMClassifier is currently the best model. We can make predictions using this model.

In [None]:
pred = models[1].predict(Xtest)

2. get_classification_report: We can get a detailed metric report for a classification task using the get_classification_report function. This accepts as argument the predicted class and the truth values. 

In [None]:
ds.model.get_classification_report(pred, ytest)

3. plot_feature_importance: This function will make a bar plot of the most important features of a trained model. 

In [None]:
model = models[1]  #get a model from the list of returned models
features = train_enc.columns  #get the feature names from the processed data

ds.model.plot_feature_importance(model, features)

Note: We demonstrated the examples using a classification task. You can also apply the same functions to your regression problems. Other functions available in the model model is train_classifier and make_submission_file.

_____________________________________________________________
[LINK](https://github.com/risenW/datasist) TO DATASIST REPO ON GITHUB

[LINK](https://risenw.github.io/datasist/index.html) TO API DOCUMENTATION

THANK YOU FOR LISTENING