In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left"> Table of Contents </h1>

#### 1) Introduction

#### 2) Load Required Libraries

#### 3) Read Data

#### 4) SweetViz (AutoEDA)

>       Method 1: To visualize in html format

>       Method 2: To visualize in kaggle notebook

>       Method 3: To visualize in Google Colab notebook

>       Method 4: Split and Compare (Dataset comparisons)

>       Method 5: Target Analysis

>       Method 6: Skip Variables

>       Method 7: Comparing categories within a column - Such as Sex, Embarked and Pclass

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 1) Introduction </h1>

##### **1) Exploratory Data Analysis** is a process where we tend to analyze the dataset and summarize the main characteristics of the dataset often using visual 

methods. EDA is really important because if you are not familiar with the dataset you are working on, then you won’t be able to infer something from that data. 

However, EDA generally takes a lot of time.

>    In this notebook, we will work on **Automating EDA using Sweetviz.** It is a python library that generates beautiful, high-density visualizations to start your EDA. Let us explore Sweetviz in detail.

>    **Pandas Profiling** will not work properly when you have many features in your dataset (Advanced Housing Price) and ran out of memory. **SweetViz** works much better than pandas profiling.


#### **2) Sweetviz 2.0**

 - is an open-source pandas-based library to perform the primary EDA task without much hassle or with **just two lines of code.** It also generates a summarised report with great visualizations.

 - Sweetviz is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. Output is a fully self-contained HTML application.

 - The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

>    **Target analysis:** shows how a target value (e.g. "Survived" in the Titanic dataset) relates to other features

>    **Dataset comparisons:** between datasets (e.g. "Train vs Test") and intra-set (e.g. "Male vs Female")

>    **Correlation/associations:** full integration of numerical and categorical data correlations and associations, all in one graph and table

>    **Visualize and compare:**

       - Distinct datasets (e.g. training vs test data)
       
       - Intra-set characteristics (e.g. male versus female)

>    **Mixed-type associations:**

       - Sweetviz integrates associations for,to provide maximum information for all data types.
       
         - numerical (Pearson's correlation)
         
         - categorical (uncertainty coefficient)
         
         - categorical-numerical (correlation ratio)
       
>    **Type inference:** automatically detects numerical, categorical and text features, with optional manual overrides

>    **Summary information:**

      - Type, unique values, missing values, duplicate rows, most frequent values
      
      - Numerical analysis:
      
      - min/max/range, quartiles, mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
      
#### There are 3 main functions for creating reports:

   - **analyze(…)**

   - **compare(…)**
       
   - **compare_intra(…)**

In [None]:
from IPython.display import Image

In [None]:
Image('../input/imagespandas-profiling/sweetviz_05.png',width=500,height=500)

In [None]:
Image('../input/imagespandas-profiling/sweetviz_06.png',width=800,height=800)

<h2 style=color:blue align="left"> Creating a report: </h2>

<h3 style=color:green align="left"> Step 1: Installing Sweetviz </h3>

        pip install sweetviz

<h3 style=color:green align="left"> Step 2: Load the pandas dataframe(s) </h3>

       import sweetviz
 
       import pandas as pd
 
       train = pd.read_csv("/kaggle/input/mobile-price-classification/train.csv")
 
       test = pd.read_csv("/kaggle/input/mobile-price-classification/test.csv")
 
 
<h3 style=color:green align="left"> Step 3: create the report </h3>

 - **analyze()** for a single dataset (Sweetviz has a function named Analyze() which analyzes the whole dataset and provides a detailed report with visualization)

 - **compare()** to compare 2 datasets (e.g. Test versus Train)

 - **compare_intra()** to compare 2 sub-populations within a same dataset
 
 
<h3 style=color:green align="left"> Step 4: generate output </h3>

 - report.show_html()
 
 - With the default options, this will create a file **"SWEETVIZ_REPORT.html"** and pop open a browser. If you are operating inside a notebook, that file will be generated but the browser may not pop up **(using show_notebook()** is recommended for notebooks, see documentation).

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 2) Load Required Libraries </h1>

In [None]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 3) Read Data </h1>

In [None]:
mobile_train = pd.read_csv("/kaggle/input/mobile-price-classification/train.csv")
mobile_test = pd.read_csv("/kaggle/input/mobile-price-classification/test.csv")

titanic_train = pd.read_csv("/kaggle/input/titanic/train.csv")
titanic_test = pd.read_csv("/kaggle/input/titanic/test.csv")

netflix = pd.read_csv("/kaggle/input/netflix-shows/netflix_titles.csv")

In [None]:
display(mobile_train.head(3))
display(mobile_test.head(3))

In [None]:
display(titanic_train.head(3))
display(titanic_test.head(3))

In [None]:
display(netflix.head(3))

In [None]:
print('Size of mobile price train dataset:', mobile_train.shape)
print('\nSize of mobile price test dataset:', mobile_test.shape)
print('\nSize of titanic train dataset:', titanic_train.shape)
print('\nSize of titanic test dataset:', titanic_test.shape)
print('\nSize of netflix dataset:', netflix.shape)

In [None]:
print('Missing Values in mobile price train:\n\n', mobile_train.isnull().sum())
print('\n\nMissing Values in mobile price test:\n\n', mobile_test.isnull().sum())
print('\n\nMissing Values in titanic train:\n\n', titanic_train.isnull().sum())
print('\n\nMissing Values in titanic test:\n\n', titanic_test.isnull().sum())
print('\n\nMissing Values in netflix:\n\n', netflix.isnull().sum())

In [None]:
print('Mobile Price Classification:\n\n', mobile_train.info())
print('\n\nTitanic:\n\n', titanic_train.info())
print('\n\nNetflix:\n\n', netflix.info())

In [None]:
mobile_train['price_range'].value_counts()

In [None]:
mobile_train['wifi'].value_counts()

In [None]:
netflix['type'].value_counts()

netflix['type'] = netflix['type'].map({'Movie':0, 'TV Show':1})

In [None]:
netflix['rating'].value_counts()

In [None]:
netflix['rating'] = netflix['rating'].fillna(netflix['rating'].mode()[0])

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 4) SweetViz (AutoEDA) </h1>

In [None]:
!pip install sweetviz

In [None]:
import sweetviz as sv

<h2 style=color:blue align="left"> Generate the Profiling Report in five ways </h2>

--------------------

<h3 style=color:green align="left"> Method 1: To visualize in html format </h3>


##### import sweetviz as sv
##### my_report = sv.analyze(df)
##### my_report.show_html()        # Default arguments will generate to "SWEETVIZ_REPORT.html"

--------------------------

<h3 style=color:green align="left"> Method 2: To visualize in kaggle notebook </h3>

##### import sweetviz as sv
##### my_report = sv.analyze(df)
##### my_report.show_notebook(w="100%", h="full")      # if working in Kaggle

------------------------

<h3 style=color:green align="left"> Method 3: To visualize in Google Colab notebook </h3>

##### import sweetviz as sv
##### my_report = sv.analyze(df)
##### my_report.show_notebook() # if working in colab

-----------------------

<h3 style=color:green align="left"> Method 4: Split and Compare (Dataset comparisons) </h3>

#### a) Comparision single (train) Datframe by Split
#### b) Comparision two (train and test) Datframes

----------------------

<h3 style=color:green align="left"> Method 5: Target Analysis </h3>

#### a) Analyze single (train) Dataframe wrt Target feature
#### b) Compare two Dataframes (train and test) wrt Target feature

---------------------

<h3 style=color:green align="left"> Method 6: Skip Variables </h3>

---------------------

<h3 style=color:green align="left"> Method 7: Comparing categories within a column - Such as Sex, Embarked and Pclass </h3>

--------------------

<h1 style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Method 1: To visualize in html format </h1>

In [None]:
# Analyzing data
my_report = sv.analyze(mobile_train)

# Generating report
# Default name is SWEETVIZ_REPORT.html
my_report.show_html('EDA_Report.html', open_browser=False)

#### For reading html file **EDA_Report.html** follow steps

 - Top Right corner expand **Add data**
 
 - Check under **output / kaggle/working / EDA_Report.html**

In [None]:
import IPython
IPython.display.HTML("EDA_Report.html")

<h1 style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Method 2: To visualize in kaggle notebook </h1>

In [None]:
my_report1 = sv.analyze(mobile_train)
my_report1.show_notebook(w="100%", h="full")

<h1 style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Method 3: To visualize in Google Colab notebook </h1>

In [None]:
my_report2 = sv.analyze(mobile_train)
my_report2.show_notebook()

In [None]:
my_report3 = sv.analyze(netflix)
my_report3.show_notebook()

In [None]:
my_report3.show_html('EDA_Report_Netflix.html', open_browser=False)
IPython.display.HTML("EDA_Report_Netflix.html")

<h1 style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Method 4: Split and Compare (Dataset comparisons) </h1>

- Other than this Sweetviz can also be used to visualize the **comparison of test and train data.** For comparison let us divide this data into 2 parts, **first 1000 rows for train dataset and rest 1000 rows for the test dataset.**

- **Compare()** function of Sweetviz is used for comparison of the dataset. The commands given below will create and compare our test and train dataset.

<h1 style="background-color:yellow; font-family:newtimeroman; font-size:180%; text-align:left;"> a) Comparision single (train) Datframe by Split </h1>

### Mobile Price Classification

In [None]:
# Spliting the data into two datasets
data1 = mobile_train[0:1000]
data2 = mobile_train[1000:]

In [None]:
report_comp1 = sv.compare([data1,'DATA1'],[data2,'DATA2']) 
report_comp1.show_notebook()

In [None]:
report_comp1.show_html(filepath = 'report.html', open_browser=True, layout = 'vertical', scale=0.7)

In [None]:
report_comp1.show_notebook(w=None, h=None, scale=None, layout='vertical')

### Netflix

In [None]:
X = netflix.drop(['type','show_id'], axis=1)
y = netflix['type']

In [None]:
# Data split using 80/20 split ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.2)

In [None]:
X_train.head(3)

In [None]:
X_test.head(3)

In [None]:
# Comparision Report
report_comp1 = sv.compare([X_train, 'Train'],[X_test,'Test']) 
report_comp1.show_notebook()

<h1 style="background-color:yellow; font-family:newtimeroman; font-size:180%; text-align:left;"> b) Comparision two Datframes (train and test) </h1>

- To compare two data sets, simply use the compare() function. Its parameters are the same as analyze(), except with an inserted second parameter to cover the comparison dataframe. It is recommended to use the [dataframe, "name"] format of parameters to better differentiate between the base and compared dataframes. (e.g. [my_df, "Train"] vs my_df)

             my_report = sv.compare([my_dataframe, "Training Data"], [test_df, "Test Data"], "Survived", feature_config)

In [None]:
my_comp2 = sv.compare([mobile_train, 'Train'], [mobile_test, "Test"])
my_comp2.show_notebook()

<h1 style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Method 5: Target Analysis </h1>

- We can also perform **target analysis**, but currently, it only **supports numerical or binary targets,** rather than categorical targets. Let’s consider **wifi** as a target:

<h1 style="background-color:yellow; font-family:newtimeroman; font-size:190%; text-align:left;"> a) Analyze single (train) Dataframe wrt Target feature </h1>

        my_report3 = sv.analyze(titanic_train, "Survived")
        
        my_report3.show_notebook()
        
        
        
        my_report4 = sv.analyze([titanic_train, 'Train'], target_feat='Survived')
        
        my_report4.show_notebook()

In [None]:
my_report4 = sv.analyze([titanic_train, 'Train'], target_feat='Survived')
my_report4.show_notebook()

<h1 style="background-color:yellow; font-family:newtimeroman; font-size:190%; text-align:left;"> b) Compare two Dataframes (train and test) wrt Target feature </h1>

     my_comp3 = sv.compare(titanic_train,titanic_test,'Survived')

     my_comp3.show_notebook()
     
     

     my_comp4 = sv.compare([titanic_train, "Train"], [titanic_test, "Test"], target_feat='Survived')

     my_comp4.show_notebook()

In [None]:
my_comp4 = sv.compare([titanic_train, "Train"], [titanic_test, "Test"], target_feat='Survived')
my_comp4.show_notebook()

<h1 style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Method 6: Skip Variables </h1>

In [None]:
# Generally can skip unwanted features like S.No / PassengerId / Name / Id. 
# force_cat : selected features to be considered as categorical variables eventhough they are integer in nature

feature_config = sv.FeatureConfig(skip=['PassengerId', 'Name'], force_cat=['Ticket', 'Pclass'])
my_comp5 = sv.compare([titanic_train, 'Train'], [titanic_test, 'Test'], 'Survived', feature_config)
my_comp5.show_notebook()

In [None]:
feature_config1 = sv.FeatureConfig(skip=['show_id', 'description'], force_cat=['director', 'release_year', 'type', 'rating'])

<h1 style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Method 7: Comparing categories within a column - Such as Sex, Embarked and Pclass </h1>

### Comparing two subsets of the same dataframe (e.g. Male vs Female)
- Another way to get great insights is to use the comparison functionality to **split your dataset into 2 sub-populations.**

- Support for this is built in through the **compare_intra()** function. This function takes a boolean series as one of the arguments, as well as an explicit "name" tuple for naming the **(true, false)** resulting datasets. Note that internally, this **creates 2 separate dataframes** to represent each resulting group. As such, it is more of a shorthand function of doing such processing manually.

             my_report = sv.compare_intra(my_dataframe, my_dataframe["Sex"] == "male", ["Male", "Female"], feature_config)
             

In [None]:
my_comp6 = sv.compare_intra(titanic_train, titanic_train["Sex"] == 'male', ['Male', 'Female'], 'Survived', feature_config)
my_comp6.show_notebook()

<h1 style="background-color:orange; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Method 8: Optional arguments </h1>

### 8.1) pairwise_analysis:
- Correlations and other associations can take exponential time (n^2) to complete. The default setting ("auto") will run without warning until a data set contains "association_auto_threshold" features. Past that threshold, you need to explicitly pass the parameter pairwise_analysis="on" (or ="off") since processing that many features would take a long time. This parameter also covers the generation of the association graphs (based on Drazen Zaric's concept):

In [None]:
my_report5 = sv.analyze(netflix, pairwise_analysis="off")
my_report5.show_notebook()

### 8.2) feat_cfg:
- A FeatureConfig object representing features to be **skipped, or to be forced** a certain type in the analysis. The arguments can either be a single string or list of strings. Parameters are **skip, force_cat, force_num and force_text.** The "force_" arguments override the built-in type detection. They can be constructed as follows:

           feature_config = sv.FeatureConfig(skip="PassengerId", force_text=["Age"])
           
           
#### a) skip and force_cat --> refer **Method 6: Skip Variables**

In [None]:
feature_config3 = sv.FeatureConfig(skip="show_id", force_text=["release_year"])

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:180%; text-align:center; border-radius: 15px 50px;"> If you like the kernal... Don't forget to upvote and comment!!!!!!!!!!!!!!!!! </h1>