## Challenge 3: Regression

## Create a regression model that predicts the average monthly spend of a new customers.

### Challenge Instructions
### To complete this challenge:
     1. Use the Adventure Works Cycles customer data you worked with in challenge 1 and 2 to create a regression model that         predicts a customer's average monthly spend. The model should predict average monthly spend for new customers for           whom no information about average monthly spend or previous bike purchases is available.
     
     2. Download the test data. This is the same test data that you have used in classification challenge. This data               includes customer features but does not include bike purchasing or average monthly spend values.
     
     3. Use your model to predict on the corresponding test dataset. Don't forget to apply what you've learned throughout           this course.
     
     4. Go to the next page to check how well your prediction against the actual result.

In [None]:
import numpy as np
import pandas as pd



In [None]:
df = pd.read_csv("FinalAdv.csv")
df.info()

In [None]:

df.shape

In [None]:
print(df.CustomerID.unique().shape)

In [None]:
print(df.CustomerID.shape)

In [None]:
df.CustomerID.duplicated().sum()

### Get a correlation matrix
### Plot seaborn's heatmap on correlation

In [None]:
corr = df.corr()
corr

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig = plt.figure(figsize= (12,12))
sns.heatmap(corr, cbar = True,  square = True, cmap= 'coolwarm');

### The corr and heatmap gives correlation between label (AveMonthSpend) and other numerical features only. 

* And Features Like Customer ID, First Name, AddressLIne1, PhoneNumber donot have infulence on the Label AveMonthSpend.       Remove them from our analysis. 

* Regression algorithm requires features to be of numerical data type. In our data set there are categorical features. Like 
  CountryRegionName, Education, MaritalStatus, Occupation, Gendr etc. So encode these features. 

* For the Categorical features, encode them by creating dummies

In [None]:
df_cat = df[['CountryRegionName', 'Education', 'Occupation', 'Gender', 'MaritalStatus', 'AgeGroup']]
df_cat_encoded = pd.get_dummies(data = df_cat)

print(df_cat_encoded.info())
print("********\nShape = :", df_cat_encoded.shape)
(df_cat_encoded.head())

### In the test dataset there is no feature of both `BikeBuyer` and `AveMonthSpend` so do not consider them while building the model

In [None]:
df_num = df[['HomeOwnerFlag', 'NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren', 'YearlyIncome']]
df_encoded = pd.concat([df_cat_encoded, df_num], axis = 1, join = 'outer')
df_encoded.head()

In [None]:
# Draw a correlation matrix and heatmap. But here we have not considered the label 'AvgMonthSpend'. 
# So to see corr and heatmap need to consider the label


corr2 = df_encoded.corr()
corr2
import seaborn as sns
import matplotlib.pyplot as plt

fig = plt.figure(figsize= (14,14))
sns.heatmap(corr2, cbar = True,  square = True, cmap= 'coolwarm');

## Split the dataset

In [44]:
from sklearn.model_selection import train_test_split
import numpy.random as nr

nr.seed(9988)
labels = np.array(df['AveMonthSpend'])
features = np.array(df_encoded)
X_train, X_test, Y_train, Y_test = train_test_split (features, labels, test_size=0.3, random_state = 5)

In [45]:
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

(11482, 29) (11482,)
(4922, 29) (4922,)


## As the income value is bigger than the other features, need to scale the features.

In [46]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)


X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


## Build/Fir/Train the model

In [47]:
from sklearn.linear_model import LinearRegression

LinReg = LinearRegression()
LinReg.fit(X_train, Y_train)

LinearRegression()

In [48]:
print(LinReg.intercept_)
print(LinReg.coef_)

72.44618864071874
[-7.03625953e+12 -4.96421451e+12 -5.25386445e+12 -5.20126668e+12
 -5.40893796e+12 -8.73103905e+12  6.85960684e+13  5.73573036e+13
  5.74912813e+13  6.67648798e+13  4.25283577e+13 -2.39348487e+13
 -2.45181562e+13 -2.22570721e+13 -3.02407782e+13 -2.84251580e+13
 -5.84149861e+13 -5.84149861e+13  2.04892269e+13  2.04892269e+13
 -2.46511589e+12 -1.79723486e+12 -1.25456051e+12 -1.92116159e+12
  3.86505436e-02 -1.26992053e-02  1.65567563e+01  4.31690394e-03
  7.79043527e+00]


## Read the test data from `AW_test.csv` and do the predictions on test data. 

### Before predicting: 
      * make the test_data equal to train data. Which includes the following: 
        - create `Age` column from the given date and date of birth 
        - create `AgeGroup` column 
        - Consider the features which are taken for training the model. 
        - Encode the categorical features 
        - predict the 'Label' 

In [49]:
test_df = pd.read_csv('AW_test.csv')
print(test_df.shape)
test_df.head()
test_df.info()

(500, 23)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   CustomerID            500 non-null    int64 
 1   Title                 4 non-null      object
 2   FirstName             500 non-null    object
 3   MiddleName            284 non-null    object
 4   LastName              500 non-null    object
 5   Suffix                1 non-null      object
 6   AddressLine1          500 non-null    object
 7   AddressLine2          13 non-null     object
 8   City                  500 non-null    object
 9   StateProvinceName     500 non-null    object
 10  CountryRegionName     500 non-null    object
 11  PostalCode            500 non-null    object
 12  PhoneNumber           500 non-null    object
 13  BirthDate             500 non-null    object
 14  Education             500 non-null    object
 15  Occupation            500 non-

In [50]:
test_df['CurrDate'] = pd.to_datetime('1998-01-01')

test_df['BirthDate'] = pd.to_datetime(test_df['BirthDate'])

test_df['Age'] = (((test_df['CurrDate'] - test_df['BirthDate']).dt.days)/365).astype(int) # gives age in years

test_df

test_df['AgeGroup'] = '' 

test_df.loc[(test_df['Age']<25),'AgeGroup']='Under 25'
test_df.loc[(test_df['Age']>=25) & (test_df['Age']<=45), 'AgeGroup']='Between 25 and 45'
test_df.loc[(test_df['Age']>45) & (test_df['Age']<=55), 'AgeGroup']='Between 45 and 55'
test_df.loc[(test_df['Age']>55), 'AgeGroup']='Over 55'

test_df

Unnamed: 0,CustomerID,Title,FirstName,MiddleName,LastName,Suffix,AddressLine1,AddressLine2,City,StateProvinceName,...,Gender,MaritalStatus,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome,CurrDate,Age,AgeGroup
0,18988,,Courtney,A,Baker,,8727 Buena Vista Ave.,,Fremont,California,...,F,S,0,2,0,5,86931,1998-01-01,53,Between 45 and 55
1,29135,,Adam,C,Allen,,3491 Cook Street,,Haney,British Columbia,...,M,M,1,2,2,4,100125,1998-01-01,33,Between 25 and 45
2,12156,,Bonnie,,Raji,,359 Pleasant Hill Rd,,Burbank,California,...,F,M,1,2,0,4,103985,1998-01-01,64,Over 55
3,13749,,Julio,C,Alonso,,8945 Euclid Ave.,,Burlingame,California,...,M,M,1,0,0,4,127161,1998-01-01,39,Between 25 and 45
4,27780,,Christy,A,Andersen,,"42, boulevard Tremblay",,Dunkerque,Nord,...,F,M,1,1,2,2,21876,1998-01-01,32,Between 25 and 45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,24211,,Sharon,A,Shan,,5850 Westwood Dr,,Peterborough,England,...,F,S,0,0,0,0,97084,1998-01-01,28,Between 25 and 45
496,23627,,Adrienne,,Navarro,,Buergermeister-ulrich-str 900,Einkaufsabteilung,Erlangen,Bayern,...,F,S,0,4,4,4,110762,1998-01-01,47,Between 45 and 55
497,14500,,Jasmine,C,Ward,,1707 Willowwood Ct.,,Torrance,California,...,F,S,0,4,3,3,138097,1998-01-01,60,Over 55
498,22223,,Gabrielle,,Parker,,6857 Medina Drive,,Mill Valley,California,...,F,M,1,1,0,2,101465,1998-01-01,40,Between 25 and 45


In [51]:
test_df_cat = test_df[['CountryRegionName', 'Education', 'Occupation', 'Gender', 'MaritalStatus', 'AgeGroup']]
test_df_cat = pd.get_dummies(test_df_cat)
test_df_cat.head()

Unnamed: 0,CountryRegionName_Australia,CountryRegionName_Canada,CountryRegionName_France,CountryRegionName_Germany,CountryRegionName_United Kingdom,CountryRegionName_United States,Education_Bachelors,Education_Graduate Degree,Education_High School,Education_Partial College,...,Occupation_Professional,Occupation_Skilled Manual,Gender_F,Gender_M,MaritalStatus_M,MaritalStatus_S,AgeGroup_Between 25 and 45,AgeGroup_Between 45 and 55,AgeGroup_Over 55,AgeGroup_Under 25
0,0,0,0,0,0,1,1,0,0,0,...,0,0,1,0,0,1,0,1,0,0
1,0,1,0,0,0,0,1,0,0,0,...,0,1,0,1,1,0,1,0,0,0
2,0,0,0,0,0,1,0,1,0,0,...,0,0,1,0,1,0,0,0,1,0
3,0,0,0,0,0,1,0,1,0,0,...,0,1,0,1,1,0,1,0,0,0
4,0,0,1,0,0,0,0,0,1,0,...,0,0,1,0,1,0,1,0,0,0


In [52]:
test_df_num = test_df[['HomeOwnerFlag', 'NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren', 'YearlyIncome']]

test_df_encoded = pd.concat([test_df_cat, test_df_num], axis = 1, join = 'outer')
test_df_encoded.head()

Unnamed: 0,CountryRegionName_Australia,CountryRegionName_Canada,CountryRegionName_France,CountryRegionName_Germany,CountryRegionName_United Kingdom,CountryRegionName_United States,Education_Bachelors,Education_Graduate Degree,Education_High School,Education_Partial College,...,MaritalStatus_S,AgeGroup_Between 25 and 45,AgeGroup_Between 45 and 55,AgeGroup_Over 55,AgeGroup_Under 25,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome
0,0,0,0,0,0,1,1,0,0,0,...,1,0,1,0,0,0,2,0,5,86931
1,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,0,1,2,2,4,100125
2,0,0,0,0,0,1,0,1,0,0,...,0,0,0,1,0,1,2,0,4,103985
3,0,0,0,0,0,1,0,1,0,0,...,0,1,0,0,0,1,0,0,4,127161
4,0,0,1,0,0,0,0,0,1,0,...,0,1,0,0,0,1,1,2,2,21876


In [53]:
test_features = scaler.transform(test_df_encoded);



In [54]:
test_predictions = LinReg.predict(test_features)
print(test_predictions)

[ 43.48331741 106.75637222  46.75878261  88.62117768  61.2223514
  45.18458165  96.41441997 128.03309185 103.53054023  55.31931514
  60.89376404  52.68950105  74.92164056  46.31071795  36.81705877
  52.38684689  85.97795743  74.86528907 110.92894637  60.89045029
  68.02729419  76.96532788 146.96737523  85.69170567  53.83838498
  72.84286524  87.1322246  117.57498736  78.92028396  63.56038632
  68.9637025   82.12851225  37.5735727   72.9257872  107.11232006
 102.53929392 149.22028462  92.62647389  57.32532744  87.58225522
  47.32413766  82.01907291  81.32717536  49.12626989  57.01882179
  77.64757397  62.00532073  86.8288718  117.63761301  80.73361785
  79.36466189  95.88092982  82.62443881  67.84464449  48.90747954
  77.28349412  57.96664241  74.86437955  65.5316476   69.12994616
  45.89550535  68.04182844  91.37180498  83.9156408   45.16707402
  81.21712686  82.93910601 133.95667806  65.42570656 107.56247747
  85.98548616  67.84381137  93.35314292  46.10080654  67.49664706
  83.217473

In [55]:
test_predictions = pd.DataFrame(data=test_predictions);
test_predictions.reset_index();
cid = pd.DataFrame(data=test_df['CustomerID']);
cid.reset_index();
predictions_with_cid = pd.merge(cid, test_predictions, left_index=True, right_index=True);
predictions_with_cid.reset_index();
predictions_with_cid

Unnamed: 0,CustomerID,0
0,18988,43.483317
1,29135,106.756372
2,12156,46.758783
3,13749,88.621178
4,27780,61.222351
...,...,...
495,24211,48.316973
496,23627,92.069670
497,14500,81.035071
498,22223,56.711722


In [57]:
predictions_with_cid.to_csv("LinRegression_Pred.csv",index=False)