<a href="https://colab.research.google.com/github/renegade620/G2M-insight-for-Cab-Investment-firm/blob/main/eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **G2M insight for Cab Investment firm (EDA Analysis)**


## **Introduction**

XYZ is a private firm in US. Due to remarkable growth in the Cab Industry in last few years and multiple key players in the market, it is planning for an investment in Cab industry and as per their Go-to-Market(G2M) strategy they want to understand the market before taking final decision.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## **Data Exploration**

### **Cab Data**

In [49]:
cab_data = pd.read_csv('https://github.com/renegade620/G2M-insight-for-Cab-Investment-firm/raw/main/Datasets/Cab_Data.csv')
cab_data['Date of Travel'] = pd.to_datetime(cab_data['Date of Travel'], unit='D', origin=pd.Timestamp('1900-01-01'))

In [51]:
cab_data.head()

Unnamed: 0,Transaction ID,Date of Travel,Company,City,KM Travelled,Price Charged,Cost of Trip
0,10000011,2016-01-10,Pink Cab,ATLANTA GA,30.45,370.95,313.635
1,10000012,2016-01-08,Pink Cab,ATLANTA GA,28.62,358.52,334.854
2,10000013,2016-01-04,Pink Cab,ATLANTA GA,9.04,125.2,97.632
3,10000014,2016-01-09,Pink Cab,ATLANTA GA,33.17,377.4,351.602
4,10000015,2016-01-05,Pink Cab,ATLANTA GA,8.73,114.62,97.776
5,10000016,2016-01-09,Pink Cab,ATLANTA GA,6.06,72.43,63.024
6,10000017,2016-01-05,Pink Cab,AUSTIN TX,44.0,576.15,475.2
7,10000018,2016-01-09,Pink Cab,AUSTIN TX,35.65,466.1,377.89
8,10000019,2016-01-14,Pink Cab,BOSTON MA,14.4,191.61,146.88
9,10000020,2016-01-08,Pink Cab,BOSTON MA,10.89,156.98,113.256


In [9]:
cab_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359392 entries, 0 to 359391
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Transaction ID  359392 non-null  int64  
 1   Date of Travel  359392 non-null  int64  
 2   Company         359392 non-null  object 
 3   City            359392 non-null  object 
 4   KM Travelled    359392 non-null  float64
 5   Price Charged   359392 non-null  float64
 6   Cost of Trip    359392 non-null  float64
dtypes: float64(3), int64(2), object(2)
memory usage: 19.2+ MB


In [11]:
cab_data.shape

(359392, 7)

In [14]:
cab_data[['KM Travelled', 'Price Charged', 'Cost of Trip']].describe()

Unnamed: 0,KM Travelled,Price Charged,Cost of Trip
count,359392.0,359392.0,359392.0
mean,22.567254,423.443311,286.190113
std,12.233526,274.378911,157.993661
min,1.9,15.6,19.0
25%,12.0,206.4375,151.2
50%,22.44,386.36,282.48
75%,32.96,583.66,413.6832
max,48.0,2048.03,691.2


In [15]:
# check for null values
cab_data.isnull().sum()

Transaction ID    0
Date of Travel    0
Company           0
City              0
KM Travelled      0
Price Charged     0
Cost of Trip      0
dtype: int64

In [16]:
# check for duplicated values
cab_data.duplicated().sum()

0

### **Customer Data**

In [None]:
customer_data = pd.read_csv('https://github.com/renegade620/G2M-insight-for-Cab-Investment-firm/raw/main/Datasets/Customer_ID.csv')

In [18]:
customer_data.head()

Unnamed: 0,Customer ID,Gender,Age,Income (USD/Month)
0,29290,Male,28,10813
1,27703,Male,27,9237
2,28712,Male,53,11242
3,28020,Male,23,23327
4,27182,Male,33,8536


In [19]:
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49171 entries, 0 to 49170
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Customer ID         49171 non-null  int64 
 1   Gender              49171 non-null  object
 2   Age                 49171 non-null  int64 
 3   Income (USD/Month)  49171 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [20]:
customer_data.shape

(49171, 4)

In [24]:
customer_data[['Age', 'Income (USD/Month)']].describe()

Unnamed: 0,Age,Income (USD/Month)
count,49171.0,49171.0
mean,35.363121,15015.631856
std,12.599066,8002.208253
min,18.0,2000.0
25%,25.0,8289.5
50%,33.0,14656.0
75%,42.0,21035.0
max,65.0,35000.0


In [25]:
# check for null values
customer_data.isnull().sum()

Customer ID           0
Gender                0
Age                   0
Income (USD/Month)    0
dtype: int64

In [26]:
#check for duplicated values
customer_data.duplicated().sum()

0

### **Transaction Data**

In [29]:
transaction_data = pd.read_csv('https://github.com/renegade620/G2M-insight-for-Cab-Investment-firm/raw/main/Datasets/Transaction_ID.csv')

In [None]:
transaction_data.head()

In [None]:
transaction_data.info()

In [34]:
transaction_data.shape

(440098, 3)

In [35]:
transaction_data.isnull().sum()

Transaction ID    0
Customer ID       0
Payment_Mode      0
dtype: int64

In [36]:
transaction_data.duplicated().sum()

0

### **City Data**

In [None]:
city_data = pd.read_csv('https://github.com/renegade620/G2M-insight-for-Cab-Investment-firm/raw/main/Datasets/City.csv')


In [None]:
city_data.head()

In [None]:
city_data.info()

In [None]:
city_data.shape

In [None]:
city_data.isnull().sum()

In [None]:
city_data.duplicated().sum()

## **Relationships across the files**

**Cab data** relates with the **transaction data** through the column *Transaction ID*, and with the **city data** through the column *City*.

In [None]:
cab_data.head()


**Transaction data** relates with **cab data** through the column *Transaction ID* and with the customer data through the column *Customer ID.*

In [None]:
transaction_data.head()


**City data** relates with the **cab data** through the column *City*

In [None]:
city_data.head()


The **customer data** relates with the **transaction data** through the column *Customer ID*

In [None]:
customer_data.head()