In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
sns.set_style('dark')
%matplotlib inline

pf = pd.read_csv(os.path.join(dirname, filename))
pf.head()

## Problem Understanding ##


The General Questions formed with respect to the dataset.

## Question 1 ##
​
#### Which is the leading neighbouring country from which a huge number of people cross the border? ####
​
US shares its border with mainly two countries , i.e. Mexico and Canada. There are limited opportunities for anyone to enter into the country without legal permission via air or water. So, the main concern is land. Here, the data is observed against the given label **Border** with the aim to find which side of the country is more open to Border Crossing.

## Question 2 ##

#### What is the most prefered Measure in order to cross the border? ####

US has a strong defense millitary organisation, but there is a huge number of people that gain entry by crossing border. So, it may happen that some of the **measures** that are being used to cross the border, gets neglected or are not given the importance. So, it is possible to get an uneven distribution of measures against values.

## Question 3 ##

#### Which ports are more sensitive areas? ####

The main thing to do after Crossing a Border is to find a **sheltor**. So, an observation is made with the Port Names against the values to observe which are the most sensitive one's. It can also be used as a justification of Question 1.

## Question 4 ##

#### Is there any specific relation with the count of border crossing with the Date? ####

The climatic conditions of US is not that evenly distributed. So, a harsh weather is not always advisable to cause problems in the dataset, while a favourable climatic condition can enhance the value of counts. So, an analysis is made by keeping in mind the climatic conditions and observing if there is any particular month where the value is maximum.

## Question 5 ##

#### How well can the Value be predicted? What aspects correlate well to the number of people crossing the border? ####

The final question always remains that what is the current scenario and based on the conditions until now what can be the condition of the future. This last question emphasizes on finding the feature which are related to the Border-Crossing value and can be used for future analysis.

# Section 2 : Data Understanding #

It is always essential to look the type of data which is used.

Starting from the dimensions of input data 


In [None]:
(rows,column) = pf.shape

In [None]:
rows, column

Therefore the dataset contains 355511 rows and 7 columns in total

In [None]:
# The total number of blank cells in the dataset

np.sum(pf.isnull())

In [None]:
pf.info()

## Question 1 - Which is the leading neighbouring country from which a huge number of people cross the border?##
As there is no presence of blank data, therefore we can directly go to our first question.

In [None]:
pf.groupby("Border").sum()["Value"].sort_values()

In [None]:
pf.groupby("Border").sum()["Value"].sort_values().plot(kind = 'bar');
plt.title("Which Border is more exposed?")

Hence, Clearly through the bar chart we get our answer for 1st Question, the <b> US-Mexico Border </b> is more vurnerable to people crossing the border

## Question 2 - What is the most prefered Measure in order to cross the border?##

The importance of Measure in crossing the border

In [None]:
m_val = pf.Measure.value_counts()

(m_val/rows).plot(kind = 'bar');
plt.title("Which is the most occured Measure to cross the border?");

The exact share of values of Measures will be beneficial 

In [None]:
pf.groupby("Measure").sum()["Value"].sort_values()

In [None]:
pf.groupby("Measure").sum()["Value"].sort_values().plot(kind = 'bar');
plt.title("Which is the most used measure to vross the border?")

 **Personal Vehicle Passengers** are the category of people having maximum tendency to cross the border

An analysis of Border wise Measures may be beneficial to visualize the data furthur 

In [None]:
pf.groupby(["Measure","Border"]).sum()["Value"].plot(kind = 'bar');
plt.title("Measures used in respective borders");

In [None]:
pf.groupby(["Measure","Border"]).sum()["Value"]

Hence, it is clear that overall <b> Personal Vehicles </b> are used mostly to cross the border.

## Question 3 - Which ports are more sensitive areas?##

Which are the specific ports that are sensitive ??

Well, it is clear by question 1, that which border is more sensitive, but now the question is there any particular state with the maximum tendency for border-crossing

In [None]:
p_val = pf.State.value_counts()

(p_val/rows).plot(kind = 'bar');
plt.title("Which is the most found Measure to cross the border?");

In [None]:
pf.groupby("State").sum()["Value"].sort_values()

In [None]:
pf.groupby("State").sum()["Value"].sort_values().plot(kind = 'bar');
plt.title("Which state is more exposed to people crossing the border ?")

Hence it is clear by the fact that Texas has the highest tendency following California and Arizona.

Now, Let's observe if we can co-relate between the Results of Question 1

In [None]:
pf.groupby(["State","Border"]).sum()["Value"].plot(kind = 'bar');
plt.title("Measures used in respective borders");

The three most sensitive states (Texas, California, Arizona) are the one's lying along the US-Mexico and hence justifying the findings of Question 1

## Section 3 : Prepare Data ##

Now, We have established that a relation exists between the Value and Border, Measures,State. 

In order to find any particular time affecting Values, we need to analyse the Date with Values. Instead of directly working with dd-mm-yyyy format creating new frames of Date, Month, Year helps analyse the Data better !! Let's Check

In [None]:
pf['Date'] = pd.to_datetime(pf['Date']) # converting the date column to datetime format for ease of conversion
pf['Date'].head()


In [None]:
pf['year'] = pf['Date'].dt.year
pf['month'] = pf['Date'].dt.month
pf['day'] = pf['Date'].dt.day

In [None]:
pf.head()

In [None]:
sum_crossing = pf.groupby("year").sum()["Value"].reset_index()
sum_crossing

## Question 4 - Is there any specific relation with the count of border crossing with the Date?##

Now, the question arrises the relation between the time and Value, is there any preference of time where people find it easy to cross the border??

In [None]:
plt.figure(figsize=(15,5))
plt.grid()
sns.set_style('dark')
chart = sns.barplot(x = 'year',y = 'Value',data=sum_crossing);
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.title('Amount per year');

A further analysis on the year is done. (Done later)

Now, Let's check the same for the date and month.

In [None]:
sum_month=pf.groupby('month').sum()['Value'].reset_index()

In [None]:
plt.figure(figsize=(15,10))
plt.grid()
sns.set_style('dark')
sns.barplot(x='month',y='Value',data=sum_month);

The month data is completely fine and shows a maximum rise generally in the Summer and Autumn Season, while a decrease in the Winter season. It can be due to the climatic condition in the Northern border.

In [None]:
sum_day=pf.groupby('day').sum()['Value'].reset_index()

In [None]:
plt.figure(figsize=(15,5))
plt.grid()
sns.set_style('dark')
sns.barplot(x='day',y='Value',data=sum_day)
plt.title('Amount per year')

While analysing the data, the days are marked as 1. Therefore we can consider the day column of the dataset as a constant parameter. So, it is better to drop the value as a particular Scalar Quantity does not affect the result of a Linear Regression Model. 

Similarly, We have separated the components of date, so the date parameter vector can be thought of a linear dependent vector with the day,month,year vector. So, dropping it makes no such variation.

In [None]:
pf.drop(columns='day',inplace=True)

In [None]:
pf.drop(columns='Date',inplace=True)
pf.head()

Now, there are value of year which suggest some future predictions, instead of totally dumping them makes no sense, We can use it as a test value and test our final model of predicting the number of people crossing the border. 

So, I create a new data frame containing only the future predicted values.

In [None]:
pf['year'].max() #finding the maximum year present

Due to the **COVID-19** outbreak the data form year 2020 is incomplete and thus cannot be trusted for a prediction. 

Therefore the year 2020 is excluded both from prediction as well as testing

In [None]:
pf = pf[pf['year'] < 2020]

In [None]:
pf.describe()

In [None]:
pf.select_dtypes(include=['object'])

Now, there are some presence of categorical data (data that are bound to some specific values) while others are continuous values and can take any values over their interval. 

The categorical values resembling real values can be directly fed to the model, but incase of **object** values, it cannot be directly fed to prediction model. So, it is better to replace them with some dummy variables.

The categorical variables are encoded using one hot encoder by the function categorical_one_hot_encoder

In [None]:
def categorical_one_hot_encoder(pf, d, col):
    '''
    Encodes the categorical values of a column and return the dataframe with the values added
    
    Parameters : The categorical_one_hot_encoder function takes following as argument
    pf - The Dataframe which contains categoricalvalues
    d -  a dictionary containing the mapping of each values in the categorical column
    col - the column to encode
    
    Returns:
    DataFrame - The dataframe with the one_hot_vectors
    
    '''
    ### integer mapping using LabelEncoder
    for label in d:
        pf[str(col)+"_"+str(label)] = np.where(pf[col] == label, 1, 0)
    
    return pf

Based on the function categorical_one_hot_encoding the categorical variables present, i.e. Border , State and Measures are encoded so that a relation between the Value can be formed

In [None]:
items_border = (['US-Canada Border', 'US-Mexico Border'])
pf = categorical_one_hot_encoder(pf, items_border, col = 'Border')

In [None]:
items_state=(['AK', 'ND', 'ME', 'CA', 'WA', 'MT', 'NY', 'OH', 'ID', 'NM', 'MN', 'VT', 'MI', 'AZ', 'TX'])
pf = categorical_one_hot_encoder(pf, items_state, col = 'State')

In [None]:
items_measure=(['Trains', 'Train Passengers', 'Buses', 'Rail Containers Empty', 'Rail Containers Full','Truck Containers Empty',
             'Bus Passengers', 'Truck Containers Full', 'Trucks', 'Pedestrians', 'Personal Vehicles',
            'Personal Vehicle Passengers'])
pf = categorical_one_hot_encoder(pf, items_measure, col = 'Measure')

## Question 5 -How well can the Value be predicted? What aspects correlate well to the number of people crossing the border?##

Now, the main question, is there any way to predict the values of Border-Crossing traffic?

To find out the answer it is best suited to obsverse the data changes and the correlation between different features

In [None]:
pf[['Port Code', 'Value', 'month', 'year']].hist();

For every port, there exist an unique postal code, So instead of using the Postal name as a feature, the provided Postal code is used. 

The histogram clearly shows a relation between the different parameters of the datatset with the prediction value

In [None]:
plt.figure(figsize=(25,15))
sns.heatmap(pf.corr(), annot=True, fmt=".2f");

The heatmap suggest a strong correlation with border parameter, measures and state. While a not so much significant but a relation between Postal Code and Value.

In [None]:
pf.info();

## Section 4 : Model Data ##

Now the important aspect is to predict the values observing a specific trend.

A **Linear Regression** model is used from sklearn package of python to observe the charecteristic and predict the trend. The **R2 score** is used for determining the correctness of the model

In [None]:
pf.columns

The total training set is divided to train and test modules to avoid overfitting of the data and for a better evaluation. The data is randomly arranged to make the training set more uniformly arranged.

Now, the input variables to the training set are stored in a list for furthur use 

In [None]:
training_params = ['Port Code', 'year', 'month', 'Border_US-Canada Border', 'Border_US-Mexico Border', 'State_AK', 'State_ND',
                   'State_ME', 'State_CA', 'State_WA', 'State_MT', 'State_NY', 'State_OH', 'State_ID', 'State_NM', 'State_MN',
                   'State_VT', 'State_MI', 'State_AZ', 'State_TX', 'Measure_Trains', 'Measure_Train Passengers',
                   'Measure_Buses', 'Measure_Rail Containers Empty', 'Measure_Rail Containers Full',
                   'Measure_Truck Containers Empty', 'Measure_Bus Passengers', 'Measure_Truck Containers Full',
                   'Measure_Trucks', 'Measure_Pedestrians', 'Measure_Personal Vehicles', 'Measure_Personal Vehicle Passengers']

In [None]:
X = pf[training_params]
y = pf['Value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [None]:
lm = LinearRegression(normalize = True)
lm.fit(X_train, y_train)

In [None]:
y_pred_test = lm.predict(X_test)
y_pred_train = lm.predict(X_train)
Score_test = r2_score(y_test, y_pred_test)
Score_train = r2_score(y_train, y_pred_train)
print(Score_train, Score_test)

The same score in case of Test and Train Set suggest that the model is not overfitted.

## Section 5 : Deployment ##

The Jupyter Notebook is available in Github:- https://github.com/deadshotsb/US-Border-Crossing-Analysis

For furthur Explanation please visite the blog in medium:- https://medium.com/p/420fc0abb1c3

Thank you for your time and support.