<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

# CDF Capstone Project

---
# **Table of Contents**
---
**1.** [**Introduction**](#Section1)<br>

**2.** [**Problem Statement**](#Section2)<br>

**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Information**](#Section4)<br>
  - **4.1** [**Data Acquisition**](#Section41)
   - **4.1.1** [**Importing Events Dataset from CSV file**](#Section411)
   - **4.1.2** [**Importing Gender_age & Brand_model Dataset from MySQL**](#Section412)
  - **4.2** [**Data Information**](#Section42)
   - **4.2.1** [**Data Information for Events Dataset**](#Section421)
   - **4.2.2** [**Data Information for Gender_age Dataset**](#Section422)
   - **4.2.3** [**Data Information for Brand_model Dataset**](#Section423)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Filtering Events Dataset by States**](#Section51)
  - **5.2** [**Merging all three datasets**](#Section52)
  - **5.3** [**Pre-Profiling Report**](#Section51)
  - **5.4** [**Handling of Missing Data**](#Section52)<br>
  - **5.3** [**Feature Engineering.**](#Section53)<br>
  - **5.4** [**Post Processing Report**](#Section54)<br>

**6.** [**Exploratory Data Analysis**](#Section6)<br>

**7.** [**Post Data Processing & Feature Selection**](#Section7)<br>
  - **7.1** [**Feature Selection**](#Section71)<br>
  - **7.2** [**Encoding the Categorical Data**](#Section72)<br>
  - **7.3** [**Data Preparation**](#Section73)<br>

**8.** [**Model Development & Evaluation**](#Section8)<br>
  - **8.1** [**ModelName - Baseline Model**](#Section81)<br>
  - **8.2** [**Using Trained Model for Prediction**](#Section82)<br>
  - **8.3** [**Model Evaluation**](#Section83)<br>

**9.** [**Summarization**](#Section9)<br>
  - **9.1** [**Conclusion**](#Section91)<br>
  - **9.2** [**Actionable Insights**](#Section92)<br>

---
<a name = Section1></a>
# **1. Introduction**
---


<center><img src="" /></center>

InsaidTelecom, one of the leading telecom players, understands that customizing offering is very important for its business to stay competitive.
Currently, InsaidTelecom is seeking to leverage behavioral data from more than 60% of the 50 million mobile devices active daily in India to help its clients better understand and interact with their audiences.

In this consulting assignment, Insaidians are expected to build a dashboard to understand user's demographic characteristics based on their mobile usage, geolocation, and mobile device properties.

Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences.

---
<a name = Section2></a>
# **2. Problem Statement**
---


<center><img src="" /></center>

---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [1]:
!pip install -q --user datascience                   
!pip install -q --user pandas-profiling              
!pip install -q --user yellowbrick                   
!pip install mysql-connector-python           

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anaconda-project 0.9.1 requires ruamel-yaml, which is not installed.
pandas-profiling 3.1.0 requires markupsafe~=2.0.1, but you have markupsafe 1.1.1 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anaconda-project 0.9.1 requires ruamel-yaml, which is not installed.
sphinx 4.0.1 requires MarkupSafe<2.0, but you have markupsafe 2.0.1 which is incompatible.




<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync. 

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [2]:
!pip install -q --upgrade pandas-profiling
!pip install -q --upgrade yellowbrick

<a name = Section33></a>
### **3.3 Importing Libraries**

In [3]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis) 
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high      
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity      
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
from scipy.stats import randint as sp_randint                       # for initializing random integer values
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
from matplotlib.pylab import rcParams                               # Backend used for rendering and GUI integration                                               
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.metrics import accuracy_score                          # For calculating the accuracy for the model
from sklearn.metrics import precision_score                         # For calculating the Precision of the model
from sklearn.metrics import recall_score                            # For calculating the recall of the model
from sklearn.metrics import precision_recall_curve                  # For precision and recall metric estimation
from sklearn.metrics import confusion_matrix                        # For verifying model performance using confusion matrix
from sklearn.metrics import f1_score                                # For Checking the F1-Score of our model  
from sklearn.metrics import roc_curve                               # For Roc-Auc metric estimation
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split                # To split the data in training and testing part     
from sklearn.feature_selection import SelectFromModel               # To perform Feature Selection over model

#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once
#-------------------------------------------------------------------------------------------------------------------------------
import mysql.connector as connection

---
<a name = Section4></a>
# **4. Data Acquisition & Information**
---

<a name = Section41></a>
### **4.1 Data Acquisition**

- In this section we will read the datasets from the various sources available.

<a name = Section411></a>
#### **4.1.1 Importing 1st dataset - 'events_data' from a csv file**

- When a user uses mobile on INSAID Telecom network, the event gets logged in this data. Each event has an event id, location (lat/long), and the event corresponds to frequency of mobile usage. Timestamp tells when the user is using the mobile.

In [4]:
# Reading the data from the events data csv file
df_events_data = pd.read_csv('C:/Users/supadhyaya8/OneDrive - DXC Production/Documents/cdf/events_data.csv')
df_events_data.head()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state
0,2765368,2.9733477869949143e+18,2016-05-07 22:52:05,77.22568,28.73014,Delhi,Delhi
1,2955066,4.734221357723753e+18,2016-05-01 20:44:16,88.38836,22.66033,Calcutta,WestBengal
2,605968,-3.264499652692493e+18,2016-05-02 14:23:04,77.25681,28.75791,Delhi,Delhi
3,448114,5.731369272434022e+18,2016-05-03 13:21:16,80.34361,13.15333,Chennai,TamilNadu
4,665740,3.3888800257079994e+17,2016-05-06 03:51:05,85.99774,23.84261,Bokaro,Jharkhand


In [5]:
#checking the shape of the dataset
df_events_data.shape

(3252950, 7)

<a name = Section411></a>
#### **4.1.2 Importing 2nd & 3rd dataset - 'gender_age_train' & 'phone_brand_device_model' from MySQL database**

- gender_age_train           :- Device_ids and their respective user gender, age and age_group
- phone_brand_device_model   :- device ids, brand, and device's model.

In [6]:
#Downloading the data from the MySQL database for gender_age_train & phone_brand_device_model onto Python by connecting to the below provided MySQL instance.
try:
    mydb = connection.connect(host="cpanel.insaid.co", database = 'Capstone1',user="student", passwd="student",use_pure=True)
    query1 = "Select * from gender_age_train;"
    query2 = "Select * from phone_brand_device_model;"
    df_gender_age = pd.read_sql(query1,mydb)
    df_brand_model = pd.read_sql(query2,mydb)
    mydb.close() #close the connection
except Exception as e:
    mydb.close()
    print(str(e))

- getting the head for the gender_age dataset

In [7]:
df_gender_age.head()

Unnamed: 0,device_id,gender,age,group
0,-8076087639492063270,M,35,M32-38
1,-2897161552818060146,M,35,M32-38
2,-8260683887967679142,M,35,M32-38
3,-4938849341048082022,M,30,M29-31
4,245133531816851882,M,30,M29-31


In [8]:
df_gender_age.shape

(74645, 4)

- getting the head for the brand_model dataset

In [9]:
df_brand_model.head()

Unnamed: 0,device_id,phone_brand,device_model
0,1877775838486905855,vivo,Y13
1,-3766087376657242966,小米,V183
2,-6238937574958215831,OPPO,R7s
3,8973197758510677470,三星,A368t
4,-2015528097870762664,小米,红米Note2


In [10]:
df_brand_model.shape

(87726, 3)

In [11]:
#df_gender_age.to_csv('gender_age_train.csv')

In [100]:
df_brand_model.to_csv('phone_brand_device_model.csv')

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of features**.

<a name = Section421></a>
#### **4.2.1 Data Information for Events Dataset**

- In this section we will see the **information about the types of features for events dataset**.

In [13]:
df_events_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3252950 entries, 0 to 3252949
Data columns (total 7 columns):
 #   Column     Dtype  
---  ------     -----  
 0   event_id   int64  
 1   device_id  float64
 2   timestamp  object 
 3   longitude  float64
 4   latitude   float64
 5   city       object 
 6   state      object 
dtypes: float64(3), int64(1), object(3)
memory usage: 173.7+ MB


In [14]:
#Checking if null values are present in events dataset
df_events_data.isnull().sum()

event_id       0
device_id    453
timestamp      0
longitude    423
latitude     423
city           0
state        377
dtype: int64

**Observations for Events Dataset:**
1. There are __3252950 records and 7 features__ in the events dataset.
2. There are __453 missing values__ for deviec_id.
3. Datatype of __device_id is float__
4. __Timestamp__ is object to be converted __to datetime.__
5. longitude and latitude have __423 missing values.__
6. State has __377 missing values.__
7. There are __4 numerical features, 2 categorical features and a timestamp.__

<a name = Section422></a>
#### **4.2.2 Data Information for Gender_age Dataset**

- In this section we will see the **information about the types of features for gender_age dataset**.

In [15]:
df_gender_age.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74645 entries, 0 to 74644
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   device_id  74645 non-null  int64 
 1   gender     74645 non-null  object
 2   age        74645 non-null  int64 
 3   group      74645 non-null  object
dtypes: int64(2), object(2)
memory usage: 2.3+ MB


In [16]:
df_gender_age['device_id'].nunique()

74645

**Observations for Gender_age Dataset:**
1. There are __74645 records and 4 features__.
2. There are __no missing values__.
3. __Correct Datatype__ of all the features.
4. There are __2 numerical features, 2 categorical features.__

<a name = Section423></a>
#### **4.2.3 Data Information for Brand_Model Dataset**

- In this section we will see the **information about the types of features for brand_model dataset**.

In [17]:
df_brand_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87726 entries, 0 to 87725
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   device_id     87726 non-null  int64 
 1   phone_brand   87726 non-null  object
 2   device_model  87726 non-null  object
dtypes: int64(1), object(2)
memory usage: 2.0+ MB


In [18]:
df_brand_model['device_id'].nunique()

87726

**Observations for Brand_model Dataset:**
1. There are __87726 records and 3 features__.
2. There are __no missing values__.
3. __Correct Datatype__ of all the features.
4. There is __1 numerical feature, 2 categorical features.__

<a name = Section5></a>

---
# **5. Data Pre-Processing**
---

<a name = Section51></a>
### **5.1 Filtering the events dataset by states (WestBengal, Karnataka, Bihar, Punjab,Gujarat, Kerala)**


- For consulting, team is to focus on 6 states: WestBengal, Karnataka, Bihar, Punjab,Gujarat and Kerala.
- We observed that Events dataset has 377 missing values in 'state' column.
- So, first we have to handle these missing values before we filter the dataset by states.

In [19]:
df_events_data[df_events_data['state'].isnull()].sample(5)

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state
30889,2164200,-3.9458265403106406e+17,2016-05-04 19:33:09,83.36656,17.75719,Visakhapatnam,
2497131,890098,4.113023436861672e+18,2016-05-02 00:52:39,75.90653,22.73251,Indore,
2092410,2798389,4.620270824872937e+18,2016-05-04 08:13:34,87.81271,22.95798,Arambagh,
520802,1880721,4.428420611296416e+17,2016-05-02 12:25:51,73.92625,18.61613,Pune,
3024207,1814157,4.113023436861672e+18,2016-05-06 22:20:25,75.90653,22.73251,Indore,


- **Handling the missing value in 'state' column for WestBengal, Karnataka, Bihar, Punjab,Gujarat, Kerala**

In [20]:
# finding the unique states
df_events_data['state'].unique()

array(['Delhi', 'WestBengal', 'TamilNadu', 'Jharkhand', 'AndhraPradesh',
       'Maharashtra', 'Gujarat', 'Kerala', 'MadhyaPradesh', 'Karnataka',
       'Rajasthan', 'Orissa', 'Punjab', 'UttarPradesh', 'Nagaland',
       'Haryana', 'Telangana', 'Chhattisgarh', 'Bihar', 'JammuandKashmir',
       'Assam', 'Goa', 'Mizoram', 'Tripura', 'Uttaranchal', nan,
       'Pondicherry', 'Manipur', 'Meghalaya', 'ArunachalPradesh',
       'HimachalPradesh', 'Chandigarh', 'AndamanandNicobarIslands'],
      dtype=object)

In [21]:
#finding out the cities for the missing state so that we can fill the missing states from the corresponding city.
(df_events_data['city'][df_events_data['state'].isnull()]).unique()

array(['Pune', 'Visakhapatnam', 'Indore', 'Chennai', 'Delhi',
       'Channapatna', 'Jaipur', 'Gangarampur', 'Arambagh'], dtype=object)

 - We will fill the missing states for these 3 cities:
 - Channapatna -> Karnataka
 - Gangarampur -> WestBengal
 - Arambagh -> WestBengal

- **Filling the missing values for the states**

In [22]:
#Replacing the nan values in state with Karnataka where city is Channapatna
df_events_data.loc[(df_events_data['city'] == 'Channapatna') & (df_events_data['state'].isnull()), 'state'] = 'Karnataka'

In [23]:
#Replacing the nan values in state with WestBengal where city is Gangarampur
df_events_data.loc[(df_events_data['city'] == 'Gangarampur') & (df_events_data['state'].isnull()), 'state'] = 'WestBengal'

In [24]:
#Replacing the nan values in state with WestBengal where city is Arambagh
df_events_data.loc[(df_events_data['city'] == 'Arambagh') & (df_events_data['state'].isnull()), 'state'] = 'WestBengal'

In [25]:
#Re-checking if events dataset contains any missing states for WestBengal, Karnataka, Bihar, Punjab,Gujarat and Kerala 
#by checking the list of cities for the missing states.
(df_events_data['city'][df_events_data['state'].isnull()]).unique()

array(['Pune', 'Visakhapatnam', 'Indore', 'Chennai', 'Delhi', 'Jaipur'],
      dtype=object)

In [26]:
df_events_data['state'].isnull().sum()

321

- Only 56 missing states belonged to WestBengal, Karnataka, Bihar, Punjab,Gujarat, Kerala

- **Filtering the events database by states (WestBengal, Karnataka, Bihar, Punjab,Gujarat, Kerala)**

In [27]:
df_events_data_filtered = df_events_data[df_events_data['state'].isin(['WestBengal', 'Karnataka', 'Bihar', 'Punjab','Gujarat', 'Kerala'])]

In [28]:
df_events_data_filtered.head()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state
1,2955066,4.734221357723753e+18,2016-05-01 20:44:16,88.38836,22.66033,Calcutta,WestBengal
28,769546,-1.8175023194785695e+18,2016-05-01 14:07:23,88.37181,22.66285,Calcutta,WestBengal
30,1750603,-5.598137337131307e+18,2016-05-05 15:47:03,70.21268,23.11837,Gandhidham,Gujarat
31,3085968,-3.808296883972396e+18,2016-05-07 01:25:47,75.51302,11.81237,Thalassery,Kerala
39,1407594,-2.9955077608063503e+18,2016-05-03 20:01:35,77.80519,13.5333,ChikBallapur,Karnataka


In [29]:
df_events_data_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 422971 entries, 1 to 3252921
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   event_id   422971 non-null  int64  
 1   device_id  422923 non-null  float64
 2   timestamp  422971 non-null  object 
 3   longitude  422929 non-null  float64
 4   latitude   422929 non-null  float64
 5   city       422971 non-null  object 
 6   state      422971 non-null  object 
dtypes: float64(3), int64(1), object(3)
memory usage: 25.8+ MB


In [30]:
#checking null values in the filtered dataset
df_events_data_filtered.isnull().sum()

event_id      0
device_id    48
timestamp     0
longitude    42
latitude     42
city          0
state         0
dtype: int64

**Observations for the filtered event dataset:**
- There are __422971 records and 7 features__ in the filtered events dataset.
- __event_id, timestamp, city and state__ columns have __no missing values.__
- __device_id__ has __48 missing values__
- __longitude and latitude__ have __42missing values.__
- __Data types__ for all the columns are correct.

<a name = Section52></a>
### **5.2 Merging the filtered events dataset with gender_age and brand_model datasets**

- Here, we will map all the records of **_events_** dataset with **_gender_age_** and **_brand_model_** datasets to get the demographic details and the brand model details of the users.
- We will create a new merged dataframe which will be used for further analysis.
- Merging of the datasets will happen on the common column _device id_.
- So, first we need to fill in the missing values for the _device id_ in the filtered events dataset.

- **Handling missing device_ids in the filtered dataset**

In [31]:
df_events_data_filtered[df_events_data_filtered['device_id'].isnull()].groupby(['longitude', 'latitude','city','state']).count() #sort_values(['state','city'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,event_id,device_id,timestamp
longitude,latitude,city,state,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70.68639,21.79069,Jetpur,Gujarat,16,0,16
73.16934,21.19428,Bardoli,Gujarat,16,0,16
75.99255,31.56175,Hoshiarpur,Punjab,16,0,16


- We see that there are 3 device_ids whose values are missing:
- They can be mapped as:
  - 1st device id where longitude = 73.16934, latitude = 21.19428, city= Bardoli , state=Gujarat
  - 2nd device id where longitude = 70.68639, latitude = 21.79069, city= Jetpur, state=Gujarat
  - 3rd device id where longitude = 75.99255, latitude = 31.56175, city= Hoshiarpur, state=Punjab

In [32]:
# get the 1st device id
df_events_data_filtered[(df_events_data_filtered['longitude'] == 73.16934) & (df_events_data_filtered['latitude'] == 21.19428)]

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state


In [33]:
# get the 1st device id where longitude = 73.16934, latitude = 21.19428, city= Bardoli , state=Gujarat
df_events_data_filtered[(df_events_data_filtered['city'] == 'Bardoli') & (df_events_data_filtered['state'] == 'Gujarat')].groupby(['device_id','longitude', 'latitude','city','state']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,event_id,timestamp
device_id,longitude,latitude,city,state,Unnamed: 5_level_1,Unnamed: 6_level_1
-8.939375088504041e+18,73.20107,21.17746,Bardoli,Gujarat,15,15
-8.245807954961571e+18,73.15158,21.19418,Bardoli,Gujarat,1,1
-8.215770519233684e+18,73.16934,21.19428,Bardoli,Gujarat,381,381
-8.101478099587991e+18,73.16179,21.2045,Bardoli,Gujarat,5,5
-8.022320605308701e+18,73.20153,21.19053,Bardoli,Gujarat,18,18
-7.629548547563756e+18,73.20758,21.15599,Bardoli,Gujarat,8,8
-7.046882659286756e+18,73.21763,21.14622,Bardoli,Gujarat,1,1
-6.62499669212828e+18,73.15566,21.18198,Bardoli,Gujarat,13,13
-6.313122852748912e+18,73.13923,21.20802,Bardoli,Gujarat,84,84
-5.509893450287351e+18,73.13438,21.18972,Bardoli,Gujarat,20,20


- 1st device id is -8215770519233685504 where longitude = 73.16934, latitude = 21.19428, city= Bardoli , state=Gujarat

In [34]:
# get the 2nd device id where longitude = 70.68639, latitude = 21.79069, city= Jetpur, state=Gujarat
df_events_data_filtered[(df_events_data_filtered['city'] == 'Jetpur') & (df_events_data_filtered['state'] == 'Gujarat')].groupby(['device_id','longitude', 'latitude','city','state']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,event_id,timestamp
device_id,longitude,latitude,city,state,Unnamed: 5_level_1,Unnamed: 6_level_1
-9.100626844458296e+18,70.68467,21.79341,Jetpur,Gujarat,32,32
-7.864383677585043e+18,70.62907,21.82055,Jetpur,Gujarat,95,95
-7.356791296642819e+18,70.6862,21.80623,Jetpur,Gujarat,1,1
-6.941872424033244e+18,70.62482,21.80541,Jetpur,Gujarat,4,4
-4.748170714083748e+18,70.62804,21.76449,Jetpur,Gujarat,9,9
-4.6192833835356e+18,70.68497,21.78174,Jetpur,Gujarat,17,17
-4.1610761399926313e+18,70.6339,21.82504,Jetpur,Gujarat,26,26
-4.109861778213653e+18,70.64444,21.81232,Jetpur,Gujarat,2,2
-3.642392870862093e+18,70.67585,21.82392,Jetpur,Gujarat,22,22
-3.5946240244289265e+18,70.66528,21.79814,Jetpur,Gujarat,54,54


 - 2nd device id is -1688015122502424064 where longitude = 70.68639, latitude = 21.79069, city= Jetpur, state=Gujarat

In [35]:
df_events_data_filtered[(df_events_data_filtered['city'] == 'Hoshiarpur') & (df_events_data_filtered['state'] == 'Punjab')].groupby(['device_id','longitude', 'latitude','city','state']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,event_id,timestamp
device_id,longitude,latitude,city,state,Unnamed: 5_level_1,Unnamed: 6_level_1
-9.179704438540888e+18,76.00749,31.56383,Hoshiarpur,Punjab,5,5
-9.003299201851031e+18,75.96806,31.57004,Hoshiarpur,Punjab,1,1
-7.959352580287944e+18,75.97433,31.60916,Hoshiarpur,Punjab,12,12
-7.516713034711112e+18,75.92446,31.60512,Hoshiarpur,Punjab,16,16
-6.314072935617406e+18,75.93961,31.55397,Hoshiarpur,Punjab,31,31
-6.016575845341768e+18,75.95764,31.60668,Hoshiarpur,Punjab,11,11
-5.583822054383365e+18,75.9536,31.60232,Hoshiarpur,Punjab,2,2
-4.25260210117708e+18,75.99204,31.62383,Hoshiarpur,Punjab,10,10
-3.9235514711039457e+18,75.9635,31.61652,Hoshiarpur,Punjab,113,113
-1.4147770740079757e+18,75.92175,31.57198,Hoshiarpur,Punjab,4,4


- 3rd device id is 1750778632182066944 where longitude = 75.99255, latitude = 31.56175, city= Hoshiarpur, state=Punjab

- **Filling the missing values for device ids**

In [36]:
#Replacing the missing device id for 1st missing device
df_events_data_filtered.loc[((df_events_data_filtered['city'] == 'Bardoli') & (df_events_data_filtered['state'] == 'Gujarat')) & (df_events_data['device_id'].isnull()), 'device_id'] = -8215770519233685504

In [37]:
#Replacing the missing device id for 2nd missing device
df_events_data_filtered.loc[((df_events_data_filtered['city'] == 'Jetpur') & (df_events_data_filtered['state'] == 'Gujarat')) & (df_events_data['device_id'].isnull()), 'device_id'] = -1688015122502424064

In [38]:
#Replacing the missing device id for 3rd missing device
df_events_data_filtered.loc[((df_events_data_filtered['city'] == 'Hoshiarpur') & (df_events_data_filtered['state'] == 'Punjab')) & (df_events_data['device_id'].isnull()), 'device_id'] = 1750778632182066944

In [39]:
df_events_data_filtered.isnull().sum()

event_id      0
device_id     0
timestamp     0
longitude    42
latitude     42
city          0
state         0
dtype: int64

In [40]:
df_events_data_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 422971 entries, 1 to 3252921
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   event_id   422971 non-null  int64  
 1   device_id  422971 non-null  float64
 2   timestamp  422971 non-null  object 
 3   longitude  422929 non-null  float64
 4   latitude   422929 non-null  float64
 5   city       422971 non-null  object 
 6   state      422971 non-null  object 
dtypes: float64(3), int64(1), object(3)
memory usage: 41.9+ MB


- **1st Merging the filtered event dataset with gender_age dataset on** _device_id_
   - First, we need to convert the data type of __device_id__ in gender_age dataset to __float__

In [41]:
df_gender_age['device_id'] = df_gender_age['device_id'].astype(df_events_data_filtered['device_id'].dtype)

In [42]:
df_gender_age.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74645 entries, 0 to 74644
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   device_id  74645 non-null  float64
 1   gender     74645 non-null  object 
 2   age        74645 non-null  int64  
 3   group      74645 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 2.3+ MB


- **Merging the two datasets**

In [43]:
df_events_gender = pd.merge(df_events_data_filtered, df_gender_age, on='device_id', how='left')
df_events_gender.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 422971 entries, 0 to 422970
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   event_id   422971 non-null  int64  
 1   device_id  422971 non-null  float64
 2   timestamp  422971 non-null  object 
 3   longitude  422929 non-null  float64
 4   latitude   422929 non-null  float64
 5   city       422971 non-null  object 
 6   state      422971 non-null  object 
 7   gender     422971 non-null  object 
 8   age        422971 non-null  int64  
 9   group      422971 non-null  object 
dtypes: float64(3), int64(2), object(5)
memory usage: 35.5+ MB


In [44]:
df_events_gender.isnull().sum()

event_id      0
device_id     0
timestamp     0
longitude    42
latitude     42
city          0
state         0
gender        0
age           0
group         0
dtype: int64

In [45]:
df_events_gender.head()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state,gender,age,group
0,2955066,4.734221357723753e+18,2016-05-01 20:44:16,88.38836,22.66033,Calcutta,WestBengal,M,30,M29-31
1,769546,-1.8175023194785695e+18,2016-05-01 14:07:23,88.37181,22.66285,Calcutta,WestBengal,F,43,F43+
2,1750603,-5.598137337131307e+18,2016-05-05 15:47:03,70.21268,23.11837,Gandhidham,Gujarat,M,23,M23-26
3,3085968,-3.808296883972396e+18,2016-05-07 01:25:47,75.51302,11.81237,Thalassery,Kerala,M,24,M23-26
4,1407594,-2.9955077608063503e+18,2016-05-03 20:01:35,77.80519,13.5333,ChikBallapur,Karnataka,M,29,M29-31


In [46]:
df_events_gender['device_id'].nunique()

19032

- **2nd Merging the event_gender dataset with brand_model dataset on** _device_id_.
   - First, we need to convert the data type of __device_id__ in brand_model dataset to __float__

In [47]:
df_brand_model['device_id'] = df_brand_model['device_id'].astype(df_events_gender['device_id'].dtype)

In [48]:
df_brand_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87726 entries, 0 to 87725
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   device_id     87726 non-null  float64
 1   phone_brand   87726 non-null  object 
 2   device_model  87726 non-null  object 
dtypes: float64(1), object(2)
memory usage: 2.0+ MB


- **Merging the two datasets to get final merged dataset**

In [49]:
df_final_merged = pd.merge(df_events_gender, df_brand_model, on='device_id', how='left')
df_final_merged.head()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state,gender,age,group,phone_brand,device_model
0,2955066,4.734221357723753e+18,2016-05-01 20:44:16,88.38836,22.66033,Calcutta,WestBengal,M,30,M29-31,vivo,X5M
1,769546,-1.8175023194785695e+18,2016-05-01 14:07:23,88.37181,22.66285,Calcutta,WestBengal,F,43,F43+,OPPO,R819T
2,1750603,-5.598137337131307e+18,2016-05-05 15:47:03,70.21268,23.11837,Gandhidham,Gujarat,M,23,M23-26,魅族,MX3
3,3085968,-3.808296883972396e+18,2016-05-07 01:25:47,75.51302,11.81237,Thalassery,Kerala,M,24,M23-26,vivo,X5L
4,1407594,-2.9955077608063503e+18,2016-05-03 20:01:35,77.80519,13.5333,ChikBallapur,Karnataka,M,29,M29-31,OPPO,R7 Plus


In [50]:
df_final_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 422971 entries, 0 to 422970
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   event_id      422971 non-null  int64  
 1   device_id     422971 non-null  float64
 2   timestamp     422971 non-null  object 
 3   longitude     422929 non-null  float64
 4   latitude      422929 non-null  float64
 5   city          422971 non-null  object 
 6   state         422971 non-null  object 
 7   gender        422971 non-null  object 
 8   age           422971 non-null  int64  
 9   group         422971 non-null  object 
 10  phone_brand   422971 non-null  object 
 11  device_model  422971 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 42.0+ MB


In [51]:
df_final_merged.isnull().sum()

event_id         0
device_id        0
timestamp        0
longitude       42
latitude        42
city             0
state            0
gender           0
age              0
group            0
phone_brand      0
device_model     0
dtype: int64

**Observations of the final merged dataset:**
- There are __422971 records and 12 features.__
- __longitude and latitude__ columns have 42 missing values each. There are __84 missing cells.__
- __Timestamp is object__ type needs to be converted __to datetime.__
- Data types for all other columns are correct.
- There are __6 categorical columns, 5 numerical columns and a timestamp.__

In [52]:
df_final_merged.describe()

Unnamed: 0,event_id,device_id,longitude,latitude,age
count,422971.0,422971.0,422929.0,422929.0,422971.0
mean,1635553.21051,5.433259387169599e+16,82.3353,20.79314,31.44556
std,930915.34686,5.33046150904228e+18,6.38155,5.4245,9.78804
min,20881.0,-9.221066489596333e+18,12.5674,8.41244,11.0
25%,822921.0,-4.582571963008607e+18,76.5506,15.20491,25.0
50%,1632956.0,1.2446067271300688e+17,85.51903,22.62057,29.0
75%,2443386.0,4.7032860510451e+18,88.40805,23.10415,36.0
max,3252946.0,9.222849349208141e+18,89.62207,41.8719,88.0


In [53]:
df_final_merged.describe(include=['object'])

Unnamed: 0,timestamp,city,state,gender,group,phone_brand,device_model
count,422971,422971,422971,422971,422971,422971,422971
unique,290235,311,6,2,12,91,1092
top,2016-05-05 10:39:01,Calcutta,WestBengal,M,M32-38,小米,红米note
freq,15,122381,196203,272093,54822,108188,17759


**Observations:**
- __event id is unique as expected.__
- __device id has 19032 distinct values__ and no missing or zero values present.
- __Age: Minimum age is 11 and maximum age is 88.__ 
- __Age distribution is right skewed__ since the mean is to the right of median.
- __50 % of the users are between the age 25-36 years. 75% of the users are less than 36 years of age.__ 
- From 75% (36 yrs) to max age (88) the distribution is widely spread out. There seems to be outliers here.
- __Timestamp is highly cardinal as expected.__
- __City__ has 311 unique values and __Calcutta__ tops the list in network usauge.
- __State__ has 6 unique values and __West Bengal__ tops the list.
- __There are 2 gender types. Male users are more than Female users, approximately 64:36 ratio respectively.__
- There are __12 age groups__. Most of the users are Males in age group of 32-38 years.
- There are __91 phone brands__ used in these 6 states and __小米 (Xiaomi)__ is the highly used phone brand.
- There are __1092 device models__ used in these 6 states and __红米note (Redmi Note)__ is the highly used phone model.

<a name = Section53></a>
### **5.3 Pre-Profiling Report**

- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [54]:
profile = ProfileReport(df = df_final_merged)
profile.to_file(output_file = 'CDF Capstone Pre Profiling Report.html')
print('Accomplished!')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Accomplished!


__Observations from Pandas Profiling before Data Processing__<br><br>
__Dataset info__:
- Number of variables: 12
- Number of observations: 422971
- Missing cells: 84
- No duplicate rows:0

__Variables types__: 
- Numeric = 5
- Categorical = 7


<br>  

- event id is unique as expected.
- device id has 19032 distinct values (4.5%) and no missing or zero values present.
- Timestamp is highly cardinal and uniform. It has 68.6% distinct values.
- longitude and latitude have 42 missing values each which is < 0.1%
- City has 311 distinct values. Calcutta with highest frequency 28.9% followwed by Bangalore 11.8%. Rest other cities are <0.5%
- State has 6 distinct values. WestBengal with highest frequency 46.4% followwed by Karnataka 23.4%. Rest other states are <10%
- Gender has 2 distinct values. Male is 64.3% and Female 35.7%. It is highly correlated with group column.
- Age distribution is right skewed. Mean age is 31.4.
- For full details check out the report.



<a name = Section54></a>
### **5.4 Handling of Missing Data**

- In this section we will **handle** **missing information** such as **null data** and **zero data**.

- **Handling the missing values for longitude and latitude**

In [66]:
df_final_merged[df_final_merged['longitude'].isnull()].groupby('device_id').count()

Unnamed: 0_level_0,event_id,timestamp,longitude,latitude,city,state
device_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1.3200509770197112e+18,14,14,0,0,14,14
3.0991685461987686e+18,14,14,0,0,14,14
6.774071338248977e+18,14,14,0,0,14,14


In [85]:
df_final_merged[df_final_merged['device_id'] == 1320050977019711232].sample()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state,gender,age,group,phone_brand,device_model
328782,330457,1.3200509770197112e+18,2016-05-02 08:29:17,87.57074,26.21192,Araria,Bihar,F,36,F33-42,vivo,X1ST


In [86]:

#df_final_merged[(df_final_merged['device_id'] == 1320050977019711232) & (df_final_merged['longitude'].isnull())].groupby(['device_id','city','state']).count()
df_final_merged[(df_final_merged['device_id'] == 1320050977019711232)].groupby(['device_id','city','state','longitude','latitude']).count()
#df_final_merged[df_final_merged['device_id'] == 1320050977019711232].groupby(['device_id','city','state']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,event_id,timestamp,gender,age,group,phone_brand,device_model
device_id,city,state,longitude,latitude,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1.3200509770197112e+18,Araria,Bihar,87.57074,26.21192,482,482,482,482,482,482,482


**There is only 1 longitude & latitude for device id 1320050977019711232 for 482 records and rest of 14 records are null. So we can fill this logitude 87.57074 & latitude 26.21192 for this device id's missing longitude and latitude.** 

In [87]:
df_final_merged.loc[((df_final_merged['city'] == 'Araria') & (df_final_merged['state'] == 'Bihar') & (df_final_merged['device_id'] == 1320050977019711232)) & (df_final_merged['longitude'].isnull()), 'longitude'] = 87.57074
df_final_merged.loc[((df_final_merged['city'] == 'Araria') & (df_final_merged['state'] == 'Bihar') & (df_final_merged['device_id'] == 1320050977019711232)) & (df_final_merged['latitude'].isnull()), 'latitude'] = 26.21192

In [88]:
df_final_merged[df_final_merged['longitude'].isnull()].groupby('device_id').count()

Unnamed: 0_level_0,event_id,timestamp,longitude,latitude,city,state,gender,age,group,phone_brand,device_model
device_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3.0991685461987686e+18,14,14,0,0,14,14,14,14,14,14,14
6.774071338248977e+18,14,14,0,0,14,14,14,14,14,14,14


In [89]:
df_final_merged[df_final_merged['device_id'] == 3099168546198768640].sample()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state,gender,age,group,phone_brand,device_model
422673,1027767,3.0991685461987686e+18,2016-05-05 01:39:44,84.1409,27.1774,Bagaha,Bihar,M,37,M32-38,语信,小辣椒 M2


In [90]:
df_final_merged[(df_final_merged['device_id'] == 3099168546198768640)].groupby(['device_id','city','state','longitude','latitude']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,event_id,timestamp,gender,age,group,phone_brand,device_model
device_id,city,state,longitude,latitude,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3.0991685461987686e+18,Bagaha,Bihar,84.1409,27.1774,533,533,533,533,533,533,533


**There is only 1 longitude & latitude for device id 3099168546198768640 for 533 records and rest of 14 records are null. So we can fill this logitude 84.14090 & latitude 27.17740 for this device id's missing longitude and latitude.** 

In [91]:
df_final_merged.loc[((df_final_merged['city'] == 'Bagaha') & (df_final_merged['state'] == 'Bihar') & (df_final_merged['device_id'] == 3099168546198768640)) & (df_final_merged['longitude'].isnull()), 'longitude'] = 84.14090
df_final_merged.loc[((df_final_merged['city'] == 'Bagaha') & (df_final_merged['state'] == 'Bihar') & (df_final_merged['device_id'] == 3099168546198768640)) & (df_final_merged['latitude'].isnull()), 'latitude'] = 27.17740

In [92]:
df_final_merged[df_final_merged['longitude'].isnull()].groupby('device_id').count()

Unnamed: 0_level_0,event_id,timestamp,longitude,latitude,city,state,gender,age,group,phone_brand,device_model
device_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6.774071338248977e+18,14,14,0,0,14,14,14,14,14,14,14


In [93]:
df_final_merged[df_final_merged['device_id'] == 6774071338248978432].sample()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state,gender,age,group,phone_brand,device_model
156005,730212,6.774071338248977e+18,2016-05-03 09:30:14,75.26875,30.90418,Moga,Punjab,M,21,M22-,魅族,MX4


In [94]:
df_final_merged[(df_final_merged['device_id'] == 6774071338248978432)].groupby(['device_id','city','state','longitude','latitude']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,event_id,timestamp,gender,age,group,phone_brand,device_model
device_id,city,state,longitude,latitude,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6.774071338248977e+18,Moga,Punjab,75.26875,30.90418,444,444,444,444,444,444,444


**There is only 1 longitude & latitude for device id 6774071338248978432 for 444 records and rest of 14 records are null. So we can fill this logitude 75.26875 & latitude 30.90418 for this device id's missing longitude and latitude.** 

In [95]:
#filling missing values
df_final_merged.loc[((df_final_merged['city'] == 'Moga') & (df_final_merged['state'] == 'Punjab') & (df_final_merged['device_id'] == 6774071338248978432)) & (df_final_merged['longitude'].isnull()), 'longitude'] = 75.26875
df_final_merged.loc[((df_final_merged['city'] == 'Moga') & (df_final_merged['state'] == 'Punjab') & (df_final_merged['device_id'] == 6774071338248978432)) & (df_final_merged['latitude'].isnull()), 'latitude'] = 30.90418	

In [96]:
df_final_merged[df_final_merged['longitude'].isnull()].groupby('device_id').count()

Unnamed: 0_level_0,event_id,timestamp,longitude,latitude,city,state,gender,age,group,phone_brand,device_model
device_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [97]:
df_final_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 422971 entries, 0 to 422970
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   event_id      422971 non-null  int64  
 1   device_id     422971 non-null  float64
 2   timestamp     422971 non-null  object 
 3   longitude     422971 non-null  float64
 4   latitude      422971 non-null  float64
 5   city          422971 non-null  object 
 6   state         422971 non-null  object 
 7   gender        422971 non-null  object 
 8   age           422971 non-null  int64  
 9   group         422971 non-null  object 
 10  phone_brand   422971 non-null  object 
 11  device_model  422971 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 42.0+ MB


** Phone brand name conversion** 

In [99]:
# finding the unique phone brands
df_final_merged['phone_brand'].unique()

array(['vivo', 'OPPO', '魅族', '三星', '努比亚', '小米', '华为', '酷派', '华硕', '锤子',
       '乐视', 'HTC', '海信', 'TCL', '天语', '中国移动', 'ZUK', 'LG', '联想 ', '优米',
       '一加', '语信', '美图', '朵唯', '斐讯', '奇酷', '唯米', '酷比魔方', '富可视', '摩托罗拉',
       '神舟', '昂达', '青橙', '凯利通', '乡米', 'LOGO', '梦米', '青葱', '聆韵', '维图',
       '亿通', '波导', '海尔', '至尊宝', '优购', '艾优尼', '康佳', 'Lovme', '易派', '百立丰',
       '诺基亚', '欧博信', '纽曼', '酷珀', '先锋', '邦华', '宝捷讯', '酷比', '小杨树', '糯米',
       '鲜米', '沃普丰', '台电', '黑米', '优语', '米歌', '夏新', '广信', '欧新', '惠普', '虾米',
       '贝尔丰', '谷歌', '白米', '大可乐', '爱派尔', '蓝魔', '果米', '大Q', '长虹', '欧奇',
       '西米', '尼比鲁', '糖葫芦', 'E派', '飞利浦', '诺亚信', 'PPTV', '德赛', '普耐尔', '欧比'],
      dtype=object)

<a name = Section54></a>
### **5.4 Feature Engineering.**

<a name = Section55></a>
### **5.5 Post Processing Report**

- After doing **missing value Imputation**, **feature engineering**, **Removing unwanted features** we will now look at the report again.

**Observation:**

<a name = Section6></a>

---
# **6. Exploratory Data Analysis**
---

**<h4>Question: </h4>**

<a name = Section7></a>

---
# **7. Post Data Processing & Feature Selection**
---

<a name = Section71></a>
### **7.1 Feature Selection**


<a name = Section72></a>
### **7.2 Encoding Categorical Features**

<a name = Section73></a>
### **7.3 Data Preparation**

- Now we will **split** our **data** in **training** and **testing** part for further development.

<a name = Section8></a>

---
# **8. Model Development & Evaluation**
---

- In this section we will **develop xxModel namexxx using input features** and **tune** our **model if required**.

- Then we will **analyze the results** obtained and **make our observation**.

- For **evaluation purpose** we will **focus** on **Accuracy**, Also we will check for **Precision**,**Recall**,**F1-Score**,**Roc-Auc-Curve** and **Precision-Recall Curve**.

<a name = Section81></a>

## **8.1 Model Name - Baseline Model**

<a name = Section82></a>

## **8.2 Using Trained Model for Prediction**

<a name = Section83></a>

## **8.3 Model Name  Model Evaluation**

<a name = Section9></a>

---
# **9. Conclusion**
---

<a name = Section91></a>
### **9.1 Conclusion**

<a name = Section92></a>
### **9.2 Actionable Insights**