# Challenge 2: Classification
# Activity 2: Build a classification model to predict customer purchasing behavior.

You have explored and analyzed customer data collected by the Adventure Works Cycles company. Now you should be ready to apply what you have learned about the data to building, testing, and optimizing a predictive machine learning model.

Specifically, you must use any combination of Azure Machine Learning, R or Python to create a classification model using scikit learn that predicts whether or not a new customer will buy a bike.

### Challenge Instructions
To complete this challenge:

#### 1. Use the Adventure Works Cycles customer data you worked with in challenge 1 to create a classification model that predicts whether or not a customer will purchase a bike. The model should predict bike purchasing for new customers for whom no information about average monthly spend or previous bike purchases is available.
#### 2. Download the Customer test  data. This data includes customer features but does not include bike purchasing or average monthly spend values.
#### 3. Use your model to predict the corresponding test dataset. Don't forget to apply what you've learned throughout this course.
#### 4. Go to the next page to check how well your prediction against the actual result.

### This is classification problem to find whether a new customer purchase a bike or not. This is logistic classification/regression problem. 

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

In [4]:
df = pd.read_csv("FinalAdv.csv")
df.head()

Unnamed: 0,CustomerID,AveMonthSpend,BikeBuyer,FirstName,LastName,AddressLine1,City,StateProvinceName,CountryRegionName,PostalCode,...,Gender,MaritalStatus,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome,CurrDate,Age,AgeGroup
0,11000,89,0,Jon,Yang,3761 N. 14th St,Rockhampton,Queensland,Australia,4700,...,M,M,1,0,0,2,137947,1998-01-01,31.756164,Between 25 and 45
1,11001,117,1,Eugene,Huang,2243 W St.,Seaford,Victoria,Australia,3198,...,M,S,0,1,3,3,101141,1998-01-01,32.657534,Between 25 and 45
2,11002,123,0,Ruben,Torres,5844 Linden Land,Hobart,Tasmania,Australia,7001,...,M,M,1,1,3,3,91945,1998-01-01,32.410959,Between 25 and 45
3,11003,50,0,Christy,Zhu,1825 Village Pl.,North Ryde,New South Wales,Australia,2113,...,F,S,0,1,0,0,86688,1998-01-01,29.89863,Between 25 and 45
4,11004,95,1,Elizabeth,Johnson,7553 Harness Circle,Wollongong,New South Wales,Australia,2500,...,F,S,1,4,5,5,92771,1998-01-01,29.419178,Between 25 and 45


In [5]:
#df.info()
print("Shape =", df.shape)
print(df.CustomerID.unique().shape)

Shape = (16404, 24)
(16404,)


### Shape of the whole data frame and unique number of Customer IDs are same

In [6]:
df.CustomerID.duplicated().sum()

0

### No duplicate item in the customer ID 

In [None]:
df.isnull().sum() # Find missing values. 

In [None]:
df.isna().sum()

In [7]:
import sklearn.metrics as sklm
from sklearn import linear_model

In [8]:
df[['AveMonthSpend', 'BikeBuyer']].groupby('BikeBuyer').count()

Unnamed: 0_level_0,AveMonthSpend
BikeBuyer,Unnamed: 1_level_1
0,10949
1,5455


In [None]:
df.groupby('BikeBuyer').YearlyIncome.count()

In [None]:
df.BikeBuyer.nunique()

In [None]:
df.info()

In [None]:
df.nunique() # No of unique values for each feature

* There are many unique values in some of the features. Like Customer ID, First Name, AddressLIne1, PhoneNumber donot have infulence on the Label BikeBuyer. Remove theem from analysis.

* For the Categorical features, encode them by creating dummies

In [None]:
df_cat = df[['CountryRegionName', 'Education', 'Occupation', 'Gender', 'MaritalStatus', 'AgeGroup']]
df_cat_encoded = pd.get_dummies(data = df_cat)
df_cat_encoded.head()

In [None]:
df_cat_encoded.shape

In [None]:
df_num = df[['HomeOwnerFlag', 'NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren', 'YearlyIncome']]
df_encoded = pd.concat([df_cat_encoded, df_num], axis=1, join= 'outer')
df_encoded.head()

In [None]:
df_encoded.shape

In [None]:
import seaborn as sns
corr = df_encoded.corr()

sns.heatmap(corr, cbar = True,  square = True, cmap= 'coolwarm')

In [None]:
import numpy.random as nr
from sklearn.model_selection import train_test_split

nr.seed(9988)
labels = np.array(df['BikeBuyer'])
features = np.array(df_encoded)


X_train, X_test, Y_train, Y_test = train_test_split (df_encoded, labels, test_size=0.3, random_state = 5 )


In [None]:
print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape)

## As the income value is very big than the other features, need to scale the features. 

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)

X_train = scaler.transform(X_train)
X_test= scaler.transform(X_test)

In [None]:
X_train

In [None]:
X_train

## Build/Fir/Train the model

In [None]:
LogReg = linear_model.LogisticRegression()
LogReg.fit(X_train, Y_train)


In [None]:
print("Intercept= ", LogReg.intercept_)
print("CoEff = ", LogReg.coef_)

In [None]:
y_pred = LogReg.predict_proba(X_test)
y_pred[:15,:]

In [None]:
y_pred = LogReg.predict(X_test)
print(len(y_pred))

## Read the test data from `AW_test.csv` and do the predictions on test data. 

### Before predicting: 
      * make the test_data equal to train data. Which includes the following: 
        - create `Age` column from the given date and date of birth 
        - create `AgeGroup` column 
        - Consider the features which are taken for training the model. 
        - Encode the categorical features 
        - predict the 'Label' 

In [10]:
test_df = pd.read_csv('AW_test.csv')
print(test_df.shape)
test_df.head()
test_df.info()

(500, 23)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   CustomerID            500 non-null    int64 
 1   Title                 4 non-null      object
 2   FirstName             500 non-null    object
 3   MiddleName            284 non-null    object
 4   LastName              500 non-null    object
 5   Suffix                1 non-null      object
 6   AddressLine1          500 non-null    object
 7   AddressLine2          13 non-null     object
 8   City                  500 non-null    object
 9   StateProvinceName     500 non-null    object
 10  CountryRegionName     500 non-null    object
 11  PostalCode            500 non-null    object
 12  PhoneNumber           500 non-null    object
 13  BirthDate             500 non-null    object
 14  Education             500 non-null    object
 15  Occupation            500 non-

In [None]:
test_df['CurrDate'] = pd.to_datetime('1998-01-01')

test_df['BirthDate'] = pd.to_datetime(test_df['BirthDate'])

test_df['Age'] = (((test_df['CurrDate'] - test_df['BirthDate']).dt.days)/365).astype(int) # gives age in years

test_df

test_df['AgeGroup'] = '' 

test_df.loc[(test_df['Age']<25),'AgeGroup']='Under 25'
test_df.loc[(test_df['Age']>=25) & (test_df['Age']<=45), 'AgeGroup']='Between 25 and 45'
test_df.loc[(test_df['Age']>45) & (test_df['Age']<=55), 'AgeGroup']='Between 45 and 55'
test_df.loc[(test_df['Age']>55), 'AgeGroup']='Over 55'

test_df

In [31]:
test_df_cat = test_df[['CountryRegionName', 'Education', 'Occupation', 'Gender', 'MaritalStatus', 'AgeGroup']]
test_df_cat = pd.get_dummies(test_df_cat)
test_df_cat.head()

Unnamed: 0,CountryRegionName_Australia,CountryRegionName_Canada,CountryRegionName_France,CountryRegionName_Germany,CountryRegionName_United Kingdom,CountryRegionName_United States,Education_Bachelors,Education_Graduate Degree,Education_High School,Education_Partial College,...,Occupation_Professional,Occupation_Skilled Manual,Gender_F,Gender_M,MaritalStatus_M,MaritalStatus_S,AgeGroup_Between 25 and 45,AgeGroup_Between 45 and 55,AgeGroup_Over 55,AgeGroup_Under 25
0,0,0,0,0,0,1,1,0,0,0,...,0,0,1,0,0,1,0,1,0,0
1,0,1,0,0,0,0,1,0,0,0,...,0,1,0,1,1,0,1,0,0,0
2,0,0,0,0,0,1,0,1,0,0,...,0,0,1,0,1,0,0,0,1,0
3,0,0,0,0,0,1,0,1,0,0,...,0,1,0,1,1,0,1,0,0,0
4,0,0,1,0,0,0,0,0,1,0,...,0,0,1,0,1,0,1,0,0,0


In [32]:
test_df_num = test_df[['HomeOwnerFlag', 'NumberCarsOwned', 'NumberChildrenAtHome', 'TotalChildren', 'YearlyIncome']]
test_df_num

Unnamed: 0,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome
0,0,2,0,5,86931
1,1,2,2,4,100125
2,1,2,0,4,103985
3,1,0,0,4,127161
4,1,1,2,2,21876
...,...,...,...,...,...
495,0,0,0,0,97084
496,0,4,4,4,110762
497,0,4,3,3,138097
498,1,1,0,2,101465


In [34]:
test_df_encoded = pd.concat([test_df_cat, test_df_num], axis = 1, join = 'outer')
test_df_encoded.head()

Unnamed: 0,CountryRegionName_Australia,CountryRegionName_Canada,CountryRegionName_France,CountryRegionName_Germany,CountryRegionName_United Kingdom,CountryRegionName_United States,Education_Bachelors,Education_Graduate Degree,Education_High School,Education_Partial College,...,MaritalStatus_S,AgeGroup_Between 25 and 45,AgeGroup_Between 45 and 55,AgeGroup_Over 55,AgeGroup_Under 25,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome
0,0,0,0,0,0,1,1,0,0,0,...,1,0,1,0,0,0,2,0,5,86931
1,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,0,1,2,2,4,100125
2,0,0,0,0,0,1,0,1,0,0,...,0,0,0,1,0,1,2,0,4,103985
3,0,0,0,0,0,1,0,1,0,0,...,0,1,0,0,0,1,0,0,4,127161
4,0,0,1,0,0,0,0,0,1,0,...,0,1,0,0,0,1,1,2,2,21876


In [39]:
test_features = scaler.transform(test_df_encoded)

<class 'numpy.ndarray'>


In [40]:
test_predictions = LogReg.predict(test_features)
print(test_predictions)

[0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 0 0 1 1 1
 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0
 0 1 1 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0
 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0
 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0
 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 1
 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1
 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0
 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 1
 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 1 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1
 0 1 1 0 0 1 0 0 0 1 1 1 

In [50]:
test_predictions = pd.DataFrame(data=test_predictions);
test_predictions.reset_index();
cid = pd.DataFrame(data=test_df['CustomerID']);
cid.reset_index();
predictions_with_cid = pd.merge(cid, test_predictions, left_index=True, right_index=True);
predictions_with_cid.reset_index();
len(predictions_with_cid)

500

In [54]:
predictions_with_cid.to_csv("zz_Class_Pred.csv",index=False)