# PYTHON EDA LIBRARIES FOR LARGE DATASET (GOOGLE COLAB)

> This notebook applies 4 different python EDA Packages (Pandas Profiling, Sweetviz, Dtale and Autoviz) on Kaggle Homesite Quote Conversion Data set (~200k rows and 300 columns)


## Download Data

Store your data in a google drive folder and then mount drive to connect to your google drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import relevant libraries

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path

Copy Path of the folder where you store your data

In [4]:
path = Path('/content/drive/MyDrive/Kaggle/data/homesite-quote')
path.mkdir(parents=True, exist_ok=True)
path


PosixPath('/content/drive/MyDrive/Kaggle/data/homesite-quote')

Import data and store it as a dataframe 

In [5]:
df = pd.read_csv(path/'train_df.csv', low_memory=False)
test_df=pd.read_csv(path/'test.csv', low_memory=False)

## EDA

> There are 4 EDA libraries that we are exploring today. Each has its own advantages and disadvantages.This page references https://towardsdatascience.com/4-libraries-that-can-perform-eda-in-one-line-of-python-code-b13938a06ae

### Pandas Profiling

> Please refer to https://github.com/pandas-profiling/pandas-profiling for the full instruction and examples

Run the below code to install Data Profiling straight from Github

In [None]:
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 


Import the library

In [7]:
from pandas_profiling import ProfileReport


Creating the a report. For this report, I set minimal=True to disable expensive computation to save runtime. This mode only gives you basic analysis with no correlation matrix (Because we have more than 300 columns, the correlation matrix is too big to be displayed). I tried running this on Full mode and it took forever to load. 

In [8]:
profile=ProfileReport(df, title="Homesite Quote Conversion",minimal=True)

Run this code in Google Colab to show the report

In [9]:
%%time
profile.to_notebook_iframe()

Output hidden; open in https://colab.research.google.com to view.

Save an output file. This will create a html page in your Google Colab temporary folder. Download or move to it Google Drive if you want to save it

In [10]:
profile.to_file(output_file="Homesite_Quote_EDA_Data_Profilling.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

A sidenote for people are working on the dataset. Looking at the descriptive analysis, it seems like a lot of columns have data with values from 1 - 25. These fields might have been encoded and transformed from categorical fields. But we don't know whether these codes are ordered according to their level of or not.

![picture](https://drive.google.com/uc?id=1FSU9QiyL-R3LBZaP_YXbwaSgUOW5p_QF)

We can also view the Warnings tab to view the issues that each column might experience

![picture](https://drive.google.com/uc?id=1qk1DzMjwxG9LDkQxxzx1_n3tX0QMPk5G)

### DTale
> D-Tale is the combination of a Flask back-end and a React front-end to bring you an easy way to view & analyze Pandas data structures.
Please refer to https://pypi.org/project/dtale/ for full instruction and examples

Import the library to Google Colab

In [None]:
!pip install -U dtale

Set up server for the app. You can use either USE_NGROK or USE_COLAB

In [12]:
import dtale
import dtale.app as dtale_app

#dtale_app.USE_NGROK=True

dtale_app.USE_COLAB = True


  defaults = yaml.load(f)


Show the report

In [13]:
%%time
dtale.show(df)

https://vq30jy6l36k-496ff2e9c6d22116-40000-colab.googleusercontent.com/dtale/main/1

CPU times: user 10.5 s, sys: 1.6 s, total: 12.1 s
Wall time: 12.7 s


Dtale is very interactive app and contains rich information and a lot of different graphs.

![picture](https://drive.google.com/uc?id=10U_lZBBMMT-YVRAzCCMrWB2_mcqDvQnn)

![picture](https://drive.google.com/uc?id=1C8yZGF5SYznn4XGuO06D7bPFOCgmmjQo)

This is also the only report that can generate correlation matrix in a very short period of time - 12.6s. The layout is also flexible enough to show all the values of the correlation matric. We can see that fields with similar name (for example CoverageFields1,2,3 etc) have quite high pearson correlation score

![picture](https://drive.google.com/uc?id=1wLUsWq9qOuQpfr8YSre_f58QIxA8S5-B)

### Sweetviz

>  Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. Output is a fully self-contained HTML application.
Please refer to https://pypi.org/project/sweetviz/ for the full instruction and examples. 

The additional feature of Sweetviz is it allows user to compare 2 datasets (for example: train set and test set)



Install the library

In [None]:
!pip install sweetviz

Reimport pandas, np and import sweetviz. The layout of the app might be affected if you don't reimport pandas and numpy

In [15]:
import pandas as pd
import numpy as np
import sweetviz as sv


Once we run the code, sweetviz also give us clear instruction of how to handle errors. The below 3 columns caused some issues so I just handle them before running the model

In [16]:
df['PropertyField29']=df['PropertyField29'].fillna(-1)
test_df['PropertyField29']=test_df['PropertyField29'].fillna(-1)
df['PersonalField84']=df['PersonalField84'].fillna(-1)
test_df['PersonalField84']=test_df['PersonalField84'].fillna(-1)

In [17]:
df['PropertyField37']=df['PropertyField37'].astype('bool')
test_df['PropertyField37']=test_df['PropertyField37'].astype('bool')

Generate the report. 
We are giving names to each dataset (optional), and specifying a target feature (optional also). Specifying a target feature is extremely valuable as it will show how "Survived" is affected by each variable

In [18]:
%%time
comparison_report = sv.compare([df,'Train'], [test_df,'Test'], target_feat='QuoteConversion_Flag',pairwise_analysis='off')

                                             |          | [  0%]   00:00 -> (? left)

CPU times: user 3min 44s, sys: 37.7 s, total: 4min 22s
Wall time: 3min 57s


Show the report. 
The report can be output as a standalone HTML file, OR embedded in this notebook. For notebooks, we can specify the width, height of the window, as well as scaling of the report itself. This line of code uses the default values (w="100%", h=750, layout="vertical")

In [19]:
comparison_report.show_notebook() 

Output hidden; open in https://colab.research.google.com to view.

Save an output file. This will create a html page in your Google Colab temporary folder. Download or move to it Google Drive if you want to save it

In [20]:
comparison_report.show_html(filepath='SWEETVIZ_REPORT_Full.html', 
            open_browser=True, 
            layout='widescreen')


Report SWEETVIZ_REPORT_Full.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


A sidenote for people are working on the dataset. Looking at the descriptive analysis comparing train set and test set, it seems like they are strikingly similar. The distribution of values in each column is almost identical for most columns. This may mean that the test set just resemble the train set perfectly

![picture](https://drive.google.com/uc?id=1HFli-v23ivKZ84vqthfIZyjvKpUh4LDe)

### Autoviz

> Automatically Visualize any dataset, any size with a single line of code.
https://autoviz.io/

Install library

In [None]:
!pip install autoviz

Import

In [None]:
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()

Download the dataset to your google colab temporary folder. This is because Autoviz seems to only able to load dataset in the direct folder

In [None]:
!wget https://media.githubusercontent.com/media/redditech/team-fast-tabulous/master/dataset/train_df.csv 

Run the report. This library doesn't seem to support large dataset with a lot of variables. It threw some error messages after a few minutes

In [24]:
%%time
df_Au = AV.AutoViz('train_df.csv')


Output hidden; open in https://colab.research.google.com to view.

## Conclusion

All of these EDA libraries really help to speed up your EDA process, especially when you are a beginner and want to spend too much time on data visualisation and analysis. 


In my opinion, Dtale has the best performance on large dataset as it can report detailed complex data analyses in the shortest time without breaking the code. It is also quite interactive and allows you to export code. 


Below is the runtime of each library: 
-	Pandas Profiling (minimal version): 2mins 32s
-	Dtale: 12.7 s
-	Sweetviz (comparing dataset): 3min 57s
-	Autoviz: more than 8 mins (doens't seem to work with large dataset)
