# Train Passenger Volume Prediction

<img src="https://wallpaperplay.com/walls/full/6/2/0/159951.jpg" height='400px' width='100%'><br/>





## 1. Problem Statement
  ### - 1.1 Introduction
  ### - 1.2 Bussiness Goal
## 2. Importing Packages
## 3. Loading Data
  ### - 3.1 Importing Data
  ### - 3.2 Description of the Datasets
  
## 4. Data Processing
  ### - 4.1 Pandas Profiling before Data Preprocessing
  ### - 4.2 Data cleaning
  ### - 4.3 Pandas Profiling after Data Preprocessing
## 5. Exploratory Data Analysis
  
## 6. Creating model and prediction.


# 1. Problem Statement

## 1.1 Introduction



<img src="ProblemStatement.png" height='400px' width='100%'><br/>

## 1.2 Bussiness Goal

To predict volume of passenger in train based on historical data.

# 2. Importing packages

In [0]:
import numpy as np                     

import pandas as pd
pd.set_option('mode.chained_assignment', None)      # To suppress pandas warnings.
pd.set_option('display.max_colwidth', -1)           # To display all the data in each column
pd.get_option("display.max_rows",10000)
pd.options.display.max_columns = 50                 # To display every column of the dataset in head()

import warnings
warnings.filterwarnings('ignore')     

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style='whitegrid', font_scale=1.3, color_codes=True)      # To apply seaborn styles to the plots.

# 3. Loading Data



## 3.1 Importing data


In [0]:
df_train_pass_volume = pd.read_csv('Train.csv', index_col = "id_code")

In [0]:
df_train_pass_volume.head()

Unnamed: 0_level_0,current_date,current_time,source_name,destination_name,train_name,target,country_code_source,longitude_source,latitude_source,mean_halt_times_source,country_code_destination,longitude_destination,latitude_destination,mean_halt_times_destination,current_year,current_week,current_day,is_weekend
id_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
isfywypmkqqhyft,2016-07-27,08:05:51 PM,station$147,station$1,ICZVZS,high,whber,4.356801,50.845658,634.16474,,,,,2016,30,Wednesday,False
mqsfxyvuqpbwomk,2016-07-27,08:06:11 PM,station$147,station$1,ICZVZS,high,whber,4.356801,50.845658,634.16474,,,,,2016,30,Wednesday,False
alspwwtbdvqsgby,2016-07-27,08:08:57 PM,station$147,station$1,ICZVZS,high,whber,4.356801,50.845658,634.16474,,,,,2016,30,Wednesday,False
szitxhhqduyrqpg,2016-07-27,08:09:08 PM,station$147,station$1,ICZVZS,high,whber,4.356801,50.845658,634.16474,,,,,2016,30,Wednesday,False
krisdqzczivvwcp,2016-07-27,08:11:01 PM,station$147,station$1,ICZVZS,high,whber,4.356801,50.845658,634.16474,,,,,2016,30,Wednesday,False


## 3.2 Description of the Datasets

In [0]:
df_train_pass_volume.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1284 entries, isfywypmkqqhyft to hfhwirltuffenfr
Data columns (total 18 columns):
current_date                   1284 non-null object
current_time                   1284 non-null object
source_name                    1284 non-null object
destination_name               1284 non-null object
train_name                     1284 non-null object
target                         1284 non-null object
country_code_source            1283 non-null object
longitude_source               1283 non-null float64
latitude_source                1283 non-null float64
mean_halt_times_source         1283 non-null float64
country_code_destination       1251 non-null object
longitude_destination          1251 non-null float64
latitude_destination           1251 non-null float64
mean_halt_times_destination    1251 non-null float64
current_year                   1284 non-null int64
current_week                   1284 non-null int64
current_day                    1284 n

**Observations :**

we observe few columns has missing values. we will try to fill them up during data processing step.

In [0]:
df_train_pass_volume.describe()

Unnamed: 0,longitude_source,latitude_source,mean_halt_times_source,longitude_destination,latitude_destination,mean_halt_times_destination,current_year,current_week
count,1283.0,1283.0,1283.0,1251.0,1251.0,1251.0,1284.0,1284.0
mean,4.292481,50.934674,278.061613,4.298829,50.92457,271.872701,2016.0,36.781153
std,0.552492,0.206194,228.954089,0.558849,0.296266,234.419223,0.0,3.175253
min,-0.126061,49.638463,0.0,0.32107,43.455128,0.0,2016.0,30.0
25%,4.039653,50.845658,78.488439,4.014573,50.835707,71.193642,2016.0,36.0
50%,4.360846,50.896456,180.598266,4.356801,50.891925,164.419075,2016.0,38.0
75%,4.482785,51.056365,467.982659,4.482785,51.035896,421.644509,2016.0,39.0
max,5.982265,51.925093,686.615607,6.958823,52.379128,686.615607,2016.0,40.0


**Observations:**

We observe there are no significant outliers.

# 4. Data Processing

 ###  4.1 Pandas Profiling before Data Preprocessing

In [0]:
# To install pandas profiling please run this command.

!pip install pandas-profiling --upgrade

In [0]:
import pandas_profiling

In [0]:
# Running pandas profiling to get better understanding of data
df_train_pass_volume.profile_report(title='Pandas Profiling before Data Preprocessing', style={'full_width':True})



## 4.2  Data cleaning

### 4.2.1 Converting current_date and current_time to pandas datatime







In [0]:
# concatenating current_date and current_time to form datetime string
df_train_pass_volume['current_date_time'] = df_train_pass_volume['current_date'].astype('str')  + " " + df_train_pass_volume['current_time'].astype('str')

# converting datetime string to pandas datetime
df_train_pass_volume['current_date_time'] = pd.to_datetime( df_train_pass_volume['current_date_time'])

# dropping current_date and current_time columns
df_train_pass_volume.drop(['current_date','current_time'], axis = 1, inplace=True)

### 4.2.2 Handling missing values in each column


**a.  country_code_destination**

In [0]:
# checking for rows in which a. country_code_destination is empty
df_train_pass_volume[df_train_pass_volume['country_code_destination'].isna()].shape

(33, 17)

**NOTE**:

There are 33 rows in which **country_code_destination** is empty . Lets fill it with median of **country_code_destination** column





In [0]:
# calculating median of country_code_destination
country_code_destination_median = df_train_pass_volume['country_code_destination'].mode()[0]

# filling missing column values with median
df_train_pass_volume['country_code_destination'].fillna(country_code_destination_median, inplace=True)

In [0]:
df_train_pass_volume[df_train_pass_volume['country_code_destination'].isna()].shape

(0, 17)

**b.  country_code_source**

In [0]:
# checking for rows in which country_code_source is empty
df_train_pass_volume[df_train_pass_volume['country_code_source'].isna()].shape

(1, 17)

**NOTE**:

There is only onr row in which **country_code_source** is empty . Lets fill it with median of **country_code_source** column






In [0]:
# calculating median of country_code_destination
country_code_source_median = df_train_pass_volume['country_code_source'].mode()[0]

# filling missing column values with median
df_train_pass_volume['country_code_source'].fillna(country_code_source_median, inplace=True)

In [0]:
df_train_pass_volume[df_train_pass_volume['country_code_destination'].isna()].shape

(0, 17)

**c.  latitude_destination**

In [0]:
# checking for rows in which latitude_destination is empty
df_train_pass_volume[df_train_pass_volume['latitude_destination'].isna()].shape

(33, 17)

**NOTE**:

There are 33 rows in which **latitude_destination** is empty . Lets fill it with median of **latitude_destination** column



In [0]:
# calculating median of country_code_destination
latitude_destination_median = df_train_pass_volume['latitude_destination'].mode()[0]

# filling missing column values with median
df_train_pass_volume['latitude_destination'].fillna(latitude_destination_median, inplace=True)

In [0]:
df_train_pass_volume[df_train_pass_volume['latitude_destination'].isna()].shape

(0, 17)

**d.  longitude_destination**

In [0]:
# checking for rows in which latitude_destination is empty
df_train_pass_volume[df_train_pass_volume['longitude_destination'].isna()].shape

(33, 17)

**NOTE**:

There are 33 rows in which **longitude_destination** is empty . Lets fill it with median of **longitude_destination** column



In [0]:
# calculating median of longitude_destination
longitude_destination_median = df_train_pass_volume['longitude_destination'].mode()[0]

# filling missing column values with median
df_train_pass_volume['longitude_destination'].fillna(longitude_destination_median, inplace=True)

In [0]:
df_train_pass_volume[df_train_pass_volume['longitude_destination'].isna()].shape

(0, 17)

**e.  mean_halt_times_destination**

In [0]:
# checking for rows in which mean_halt_times_destination is empty
df_train_pass_volume[df_train_pass_volume['mean_halt_times_destination'].isna()].shape

(33, 17)

In [0]:
# calculating median of country_code_destination
mean_halt_times_destination_median = df_train_pass_volume['mean_halt_times_destination'].mode()[0]

# filling missing column values with median
df_train_pass_volume['mean_halt_times_destination'].fillna(mean_halt_times_destination_median, inplace=True)

In [0]:
df_train_pass_volume[df_train_pass_volume['mean_halt_times_destination'].isna()].shape

(0, 17)

**f.  mean_halt_times_source**

In [0]:
# checking for rows in which mean_halt_times_source is empty
df_train_pass_volume[df_train_pass_volume['mean_halt_times_source'].isna()].shape

(1, 17)

In [0]:
# calculating median of country_code_destination
mean_halt_times_source_median = df_train_pass_volume['mean_halt_times_source'].mode()[0]

# filling missing column values with median
df_train_pass_volume['mean_halt_times_source'].fillna(mean_halt_times_source_median, inplace=True)

In [0]:
df_train_pass_volume[df_train_pass_volume['mean_halt_times_source'].isna()].shape

(0, 17)

In [0]:
# Running pandas profiling to get better understanding of data
df_train_pass_volume.profile_report(title='Pandas Profiling after Data Preprocessing', style={'full_width':True})

