<p align="center"><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="260" height="110" /></p>

---
# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
  3.1 [**Installing Libraries**](#Section31)<br>
  3.2 [**Upgrading Libraries**](#Section32)<br>
  3.3 [**Importing Libraries**](#Section33)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Summarization**](#Section9)</br>
  9.1 [**Conclusion**](#Section91)</br>
  9.2 [**Actionable Insights**](#Section91)</br>

---

---
<a name = Section1></a>
# **1. Introduction: Insaid Telecom**
---

- 
InsaidTelecom, one of the leading telecom players, understands that customizing offerings is very important for its business to stay competitive.

Currently, InsaidTelecom is seeking to leverage behavioural data from more than 60% of the 50 million mobile devices active daily in India

They are doing this to help their clients better understand and interact with their audiences.

Current Scenario

In this consulting assignment, Insaidians are expected to build a dashboard.
This dashboard will help us to understand a user's demographic characteristics based on their mobile usage, geolocation, and mobile device properties
Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts.
These efforts are relevant to their users and cater to their preferences.


- Surf out over the internet and do some research about what is happening in real life.

- Try out and make some concrete points about your point of view.

---
<a name = Section2></a>
# **2. Problem Statement**
---

- This section is emphasised on providing some generic introduction to the problem that most companies confronts.

- **Example Problem Statement:**

  - In the past few years, prices of new cars have skyrocketed, due to which most people are incapable of buying a new one.

  - Customers buying a new car always looks for assurity of their money to be worthy.

  - But due to the increased price of new cars, used car sales are on a global increase (Pal, Arora and Palakurthy, 2018).

  - There is a need for a used car price prediction system to effectively determine the worthiness of the car using a variety of features.

  - Even though there are websites that offers this service, their prediction method may not be the best.

  - Besides, different models and systems may contribute on predicting power for a used car’s actual market value.

  - It is important to know their actual market value while both buying and selling.
  
<p align="center"><img src="https://visme.co/blog/wp-content/uploads/2020/06/animated-interactive-infographics-header-wide.gif"></p>

- Derive a scenario related to the problem statement and heads on to the journey of exploration.

- **Example Scenario:**
  - Cars Absolute, an American company buys and sells second hand cars.

  - The company has earned its name because of sincerity in work and quality of services.

  - But for past few months their sales is down for some reason and they are unable to figure it out.

  - To tackle this problem they hired a genius team of data scientists. Consider you are one of them...

---
<a id = Section3></a>
# **3. Installing & Importing Libraries**
---

- This section is emphasised on installing and importing the necessary libraries that will be required.

### **Installing Libraries**

In [2]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data
!pip install -q --upgrade yellowbrick

### **Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell under Installing Libraries and Upgrading Libraries again after restarting the runtime.

In [4]:
!pip install -q --upgrade pandas-profiling                          # Upgrading pandas profiling to the latest version

### **Importing Libraries**

- You can headstart with the basic libraries as imported inside the cell below.

- If you want to import some additional libraries, feel free to do so.


In [3]:
import numpy as np
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.2f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
from collections import Counter                                     # For counting hashable objects
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import plotly.graph_objs as go                                      # For Plotly interfaced graphs
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- This section is emphasised on the accquiring the data and obtain some descriptive information out of it.

- You could either scrap the data and then continue, or use a direct source of link (generally preferred in most cases).

- You will be working with a direct source of link to head start your work without worrying about anything.

- Before going further you must have a good idea about the features of the data set:

|Id|Feature|Description|
|:--|:--|:--|
|01| car           | Car brand name| 
|02| model         | Available car different Variants|  
|03| year          | purchasing Year| 
|04| body          | Body type-Hatchback, Sedan, Crossover etc|   
|05| mileage       | car Mileage|
|06| engV          | Engine version|
|07| engType       | Car Fuel type - Petrol, Diesel, gas etc|
|08| drive         | Wheel Drive Front, back|
|09| registration  | Check if the vechile is registered|
|10| price         | Price of Car in $|


In [4]:
data = pd.read_csv('/Users/bsnijjar/Downloads/events_data.csv')
data1 = pd.read_csv('/Users/bsnijjar/Downloads/gender_age_train.csv')
data2 = pd.read_csv('/Users/bsnijjar/Downloads/phone_brand_device_model.csv')
data.info()
data.head()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3252950 entries, 0 to 3252949
Data columns (total 7 columns):
 #   Column     Dtype  
---  ------     -----  
 0   event_id   int64  
 1   device_id  float64
 2   timestamp  object 
 3   longitude  float64
 4   latitude   float64
 5   city       object 
 6   state      object 
dtypes: float64(3), int64(1), object(3)
memory usage: 173.7+ MB


Unnamed: 0,event_id,device_id,longitude,latitude
count,3252950.0,3252497.0,3252527.0,3252527.0
mean,1626475.5,1.0122000958550902e+17,78.16,21.69
std,939045.92,5.316758188197051e+18,4.24,5.79
min,1.0,-9.222956879900151e+18,12.57,8.19
25%,813238.25,-4.540611333857475e+18,75.84,17.8
50%,1626475.5,1.726820111592788e+17,77.27,22.16
75%,2439712.75,4.861813234983624e+18,80.32,28.68
max,3252950.0,9.222849349208141e+18,95.46,41.87


### **Data Description**

- To get some quick description out of the data you can use describe method defined in pandas library.

In [21]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74645 entries, 0 to 74644
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   device_id  74645 non-null  int64 
 1   gender     74645 non-null  object
 2   age        74645 non-null  int64 
 3   group      74645 non-null  object
dtypes: int64(2), object(2)
memory usage: 2.3+ MB


In [5]:
data1.describe()

Unnamed: 0,device_id,age
count,74645.0,74645.0
mean,-749135388419837.0,31.41
std,5.327149733911457e+18,9.87
min,-9.223067244542179e+18,1.0
25%,-4.617366812584265e+18,25.0
50%,-1.8413620249632024e+16,29.0
75%,4.63665589909315e+18,36.0
max,9.222849349208141e+18,96.0


In [24]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87726 entries, 0 to 87725
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   device_id     87726 non-null  int64 
 1   phone_brand   87726 non-null  object
 2   device_model  87726 non-null  object
dtypes: int64(1), object(2)
memory usage: 2.0+ MB


In [25]:
data2.head()

Unnamed: 0,device_id,phone_brand,device_model
0,1877775838486905855,vivo,Y13
1,-3766087376657242966,小米,V183
2,-6238937574958215831,OPPO,R7s
3,8973197758510677470,三星,A368t
4,-2015528097870762664,小米,红米Note2


In [10]:
combinedData = pd.merge(data, data1, on='device_id',how='left')
combinedData.head()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state,gender,age,group
0,2765368,2.9733477869949143e+18,2016-05-07 22:52:05,77.23,28.73,Delhi,Delhi,M,35.0,M32-38
1,2955066,4.734221357723753e+18,2016-05-01 20:44:16,88.39,22.66,Calcutta,WestBengal,,,
2,605968,-3.264499652692493e+18,2016-05-02 14:23:04,77.26,28.76,Delhi,Delhi,,,
3,448114,5.731369272434022e+18,2016-05-03 13:21:16,80.34,13.15,Chennai,TamilNadu,,,
4,665740,3.3888800257079994e+17,2016-05-06 03:51:05,86.0,23.84,Bokaro,Jharkhand,,,


In [12]:
FinalData = pd.merge(combinedData, data2, on='device_id',how='left')
FinalData.head()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude,city,state,gender,age,group,phone_brand,device_model
0,2765368,2.9733477869949143e+18,2016-05-07 22:52:05,77.23,28.73,Delhi,Delhi,M,35.0,M32-38,优米,UIMI3
1,2955066,4.734221357723753e+18,2016-05-01 20:44:16,88.39,22.66,Calcutta,WestBengal,,,,,
2,605968,-3.264499652692493e+18,2016-05-02 14:23:04,77.26,28.76,Delhi,Delhi,,,,,
3,448114,5.731369272434022e+18,2016-05-03 13:21:16,80.34,13.15,Chennai,TamilNadu,,,,,
4,665740,3.3888800257079994e+17,2016-05-06 03:51:05,86.0,23.84,Bokaro,Jharkhand,,,,,


In [13]:
FinalData.describe()

Unnamed: 0,event_id,device_id,longitude,latitude,age
count,3252950.0,3252497.0,3252527.0,3252527.0,16982.0
mean,1626475.5,1.0122000958550902e+17,78.16,21.69,32.43
std,939045.92,5.316758188197051e+18,4.24,5.79,9.16
min,1.0,-9.222956879900151e+18,12.57,8.19,10.0
25%,813238.25,-4.540611333857475e+18,75.84,17.8,26.0
50%,1626475.5,1.726820111592788e+17,77.27,22.16,30.0
75%,2439712.75,4.861813234983624e+18,80.32,28.68,36.0
max,3252950.0,9.222849349208141e+18,95.46,41.87,79.0


In [14]:
FinalData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3252950 entries, 0 to 3252949
Data columns (total 12 columns):
 #   Column        Dtype  
---  ------        -----  
 0   event_id      int64  
 1   device_id     float64
 2   timestamp     object 
 3   longitude     float64
 4   latitude      float64
 5   city          object 
 6   state         object 
 7   gender        object 
 8   age           float64
 9   group         object 
 10  phone_brand   object 
 11  device_model  object 
dtypes: float64(4), int64(1), object(7)
memory usage: 322.6+ MB


### **Data Information**

In [15]:
FinalData.isna().sum()

event_id              0
device_id           453
timestamp             0
longitude           423
latitude            423
city                  0
state               377
gender          3235968
age             3235968
group           3235968
phone_brand     3235968
device_model    3235968
dtype: int64

In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3252950 entries, 0 to 3252949
Data columns (total 7 columns):
 #   Column     Dtype  
---  ------     -----  
 0   event_id   int64  
 1   device_id  float64
 2   timestamp  object 
 3   longitude  float64
 4   latitude   float64
 5   city       object 
 6   state      object 
dtypes: float64(3), int64(1), object(3)
memory usage: 173.7+ MB


In [55]:
data.describe(include = 'all')
# Insert your code here...

Unnamed: 0,car,price,body,mileage,engV,engType,registration,year,model,drive
count,9576,9576.0,9576,9576.0,9142.0,9576,9576,9576.0,9576,9065
unique,87,,6,,,4,2,,888,3
top,Volkswagen,,sedan,,,Petrol,yes,,E-Class,front
freq,936,,3646,,,4379,9015,,199,5188
mean,,15633.32,,138.86,2.65,,,2006.61,,
std,,24106.52,,98.63,5.93,,,7.07,,
min,,0.0,,0.0,0.1,,,1953.0,,
25%,,4999.0,,70.0,1.6,,,2004.0,,
50%,,9200.0,,128.0,2.0,,,2008.0,,
75%,,16700.0,,194.0,2.5,,,2012.0,,


In [56]:
data['year'].value_counts()

2008    1158
2007     930
2012     767
2011     701
2013     651
2006     564
2016     459
2005     413
2010     389
2014     368
2009     347
2004     338
2003     282
2015     249
2000     231
2002     219
2001     216
1999     160
1998     152
1996     125
1997     123
1995      85
1994      76
1991      70
1990      64
1992      62
1988      60
1993      53
1989      50
1986      45
1987      43
1985      28
1984      15
1979      10
1982      10
1980       9
1981       8
1983       7
1978       7
1977       6
1976       4
1969       3
1974       2
1961       2
1962       2
1971       2
1963       2
1958       1
1953       1
1973       1
1972       1
1959       1
1975       1
1964       1
1970       1
1968       1
Name: year, dtype: int64

---
<a name = Section5></a>
# **5. Data Pre-Profiling**
---

- This section is emphasised on getting a report about the data.

- You need to perform pandas profiling and get some observations out of it...

In [None]:
# Insert your code here...

---
<a name = Section6></a>
# **6. Data Pre-Processing**
---

- This section is emphasised on performing data manipulation over unstructured data for further processing and analysis.

- To modify unstructured data to strucuted data you need to verify and manipulate the integrity of the data by:
  - Handling missing data,

  - Handling redundant data,

  - Handling inconsistent data,

  - Handling outliers,

  - Handling typos

In [58]:
null_frame = pd.DataFrame(index = data.columns.values)
# Insert your code here...

In [59]:
null_frame['Null Frequency'] = data.isnull().sum().values
null_frame

Unnamed: 0,Null Frequency
car,0
price,0
body,0
mileage,0
engV,434
engType,0
registration,0
year,0
model,0
drive,511


In [60]:
percent = data.isnull().sum().values/data.shape[0]

In [61]:
null_frame['Missing %age'] = np.round(percent, decimals = 4) * 100
null_frame

Unnamed: 0,Null Frequency,Missing %age
car,0,0.0
price,0,0.0
body,0,0.0
mileage,0,0.0
engV,434,4.53
engType,0,0.0
registration,0,0.0
year,0,0.0
model,0,0.0
drive,511,5.34


In [72]:
data.duplicated().any()

True

In [79]:
data.sort_values(by=['car', 'price','body','mileage'])

Unnamed: 0,car,price,body,mileage,engV,engType,registration,year,model,drive
7194,Acura,8550.0,sedan,145,3.2,Petrol,yes,2005,TL,front
4803,Acura,8699.0,sedan,144,3.2,Gas,yes,2005,TL,front
7390,Acura,8699.0,sedan,145,3.2,Gas,yes,2005,TL,front
8082,Acura,8700.0,sedan,145,3.2,Gas,yes,2005,TL,front
5452,Acura,11111.0,crossover,199,3.5,Petrol,yes,2005,MDX,full
5835,Acura,12900.0,sedan,126,3.5,Gas,yes,2006,RL,full
2198,Acura,15000.0,sedan,150,3.5,Gas,yes,2008,RL,full
4484,Acura,15650.0,crossover,170,3.7,Petrol,yes,2008,MDX,full
6638,Acura,17200.0,crossover,82,3.7,Petrol,yes,2008,MDX,full
9063,Acura,18500.0,crossover,85,3.7,Petrol,yes,2008,MDX,full


In [69]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9576 entries, 0 to 9575
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   car           9576 non-null   object 
 1   price         9576 non-null   float64
 2   body          9576 non-null   object 
 3   mileage       9576 non-null   int64  
 4   engV          9142 non-null   float64
 5   engType       9576 non-null   object 
 6   registration  9576 non-null   object 
 7   year          9576 non-null   int64  
 8   model         9576 non-null   object 
 9   drive         9065 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 748.2+ KB


---
<a name = Section7></a>
# **7. Data Post-Profiling**
---

- This section is emphasised on getting a report about the data after the data manipulation.

- You may end up observing some new changes, so keep it under check and make right observations.

In [None]:
# Insert your code here...

---
<a name = Section8></a>
# **8. Exploratory Data Analysis**
---

- This section is emphasised on asking the right questions and perform analysis using the data.

- Note that there is no limit how deep you can go, but make sure not to get distracted from right track.

In [None]:
# Insert your code here...

---
<a name = Section9></a>
# **9. Summarization**
---

<a name = Section91></a>
### **9.1 Conclusion**

- In this part you need to provide a conclusion about your overall analysis.

- Write down some short points that you have observed so far.

<a name = Section92></a>
### **9.2 Actionable Insights**

- This is a very crucial part where you will present your actionable insights.
- You need to give suggestions about what could be applied and what not.
- Make sure that these suggestions are short and to the point, ultimately it's a catalyst to your business.