# Homework 2b - Feature Extraction and Regression

In this part of the homework we'll be looking at the same dataset except in a completely different light. We'll move beyond simply analysing the data and will instead move towards trying to make some inferences regarding the data - predictions on when the dam's target value of (the minimum estimate) 1.5 Trillion rupees will be reached. Use the same set-up as part a

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

data = pd.read_pickle('./individual_contributions.pkl')
daily_totals = data.groupby(['Date']).sum()
daily_totals.head()


Unnamed: 0_level_0,Amount
Date,Unnamed: 1_level_1
2018-07-06,2402300.0
2018-07-09,1346261.0
2018-07-10,5374641.0
2018-07-11,24830020.0
2018-07-12,29174820.91


We'll be running a regression analysis on this data since the target variable, the funds collected, is a continuous variable. Before we are able to run any sort of regression we need to decide what features we should be using for our regression. Moreover, since we are running a regression it is important to also figure out what exactly our target variable should be. Should it be the **cumulative sum** of the amount collected **till** each day, or should it simply be the amount collected **on** each day? Whatever you decide, write code below to get that target variable. 

Hint: Using groupby on "Date" would be a good option.

In [2]:
# Code to calculate the target variable
cumulative_sum_by_date = daily_totals.cumsum()
print(cumulative_sum_by_date.tail(5))

                  Amount
Date                    
2018-10-01  4.229601e+09
2018-10-02  4.229857e+09
2018-10-03  4.283380e+09
2018-10-04  4.375781e+09
2018-10-05  4.413591e+09


## Part B: Feature Extraction (20)

You currently have 3 columns, other than the target variable (Amount), Bank, Name and Date. Which do you think should be used as the independent variable in running the regression?

Ans: Date seems to be the best suited to be used as independent variable because it contains so much information as date, day, month and weeks. All these can be used as features in our linear regression. Other columns such as banks and depositer name does not provide us with much useful information. 

One possible variable we could use is the Date variable, but it can not be used directly since it is a 'Datetime' object. Read up more on Linear Regression on the [sklearn Documentation page](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) to know about what sort of independent variables must be sent to it.

There are many different ways you can extract the right features from just the datetime column. Some useful in-built functions include the sklearn library's [LabelEncoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder), the [OneHotEncoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) and the [OrdinalEncoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder).

You need to think deeply about what sort of variables can be extracted from simply "Date", and how they would be useful in trying to figure out how many funds are being collected on any given day. One good way to go about it would be to try out the regression on many different features and see which one is better.

In [3]:
# Import the appropriate functions from sklearn #
# Extract the right features #
# This finds what day of the week it is from the datetime object where 0 is Monday and 6 is Sunday
# An example of one feature that could be extracted is given below #
# This finds what day of the week it is from the datetime object where 0 is Monday and 6 is Sunday
data['Day_int'] = data['Date'].dt.dayofweek
data['Month'] = data['Date'].dt.month
data['Date_int'] = data['Date'].dt.date
unique_data = data[['Day_int', 'Month', 'Date_int']].drop_duplicates()
unique_data = unique_data.sort_values(by='Date_int')
unique_data = unique_data.reset_index(drop=True)

# Print the entire dataframe.head() with the extracted features at the end of this cell #
print(unique_data.head())

   Day_int  Month    Date_int
0        4      7  2018-07-06
1        0      7  2018-07-09
2        1      7  2018-07-10
3        2      7  2018-07-11
4        3      7  2018-07-12


## Part C: Regression and Evaluation (40)

From here onwards, how exactly you structure your code is upto you, and the main goal is this: You want to choose a regression model from one of the many [linear_models](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) available on sklearn. If you're feeling adventurous you can try using [Support Vector Regression](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) as well, but unless you guys take out the time to understand how Support Vector Machines work, and why they might not be the best idea for such a dataset, it will not be a fruitful exercise.

You need to learn how to evaluate your model. Every sklearn regression model has a built in function that can calculate the regression score for you (as done before), and the sklearn [Mean Squared Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) function will be used to calculate the error in your test-set and your train-set. In most cases you will use either a custom function to split the dataset into a train-test set, or use the [train-test-split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Another extremely useful tool is [KFold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html). Research on cross-validation and why it is such an effective way to evaluate your Machine Learning models.

For the purpose of this assignment, the final values will be of the regression being **trained and tested on the entire dataset**. And the following results will be looked at (print these values clearly!):
1. Regression Score 
2. Mean Squared Error (Expect this to be really high, since the values of the data-set are also high)
3. The Regression Line that you get from the linear-models (either from the coefficients or from the predictions) over the data-points (I will upload a sample on Piazza)

**Lastly**, after you have trained your model, you need to build a mock data-set containing just the datetime objects. A good function to use is [python.date_range](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) that allows you to get a DateTimeIndex of whatever Date and Frequency you chose (the frequency is an extremely important parameter). You can convert that DateTimeIndex to a DataFrame and then extract the same features as you did in the previous part (making a function for feature extraction is a good idea). After that you need to print the exact **Month and Year that the 1.5 Trillion Rs target will be reached according to your regression.**

**This is an iterative process and you will have to play around with the features, the model and parameters of the regression many times before you reach a good result**

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = pd.DataFrame(unique_data['Day_int'])
y = cumulative_sum_by_date['Amount']
# print(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

reg_Day = LinearRegression()
reg_Day.fit(X_train, y_train)

print(reg_Day.score(X_train, y_train))

0.002107883904407193


#### What do you think the limitations of your regression were? What problems did you face in not being able to get a good fit?