## Linear Regression on E-commerce Customer Data

We will try to fit a linear regression model on E-commerce Data and try to predict the Yearly amount spent by a customer.

In [None]:
# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
customers = pd.read_csv('../input/ecommerce-customers/Ecommerce Customers.csv')

In [None]:
customers.info()

 info() tells us that there are 8 columns and 500 rows . Let us peak into the data using head()

In [None]:
customers.head()

Using pairplot to see if there is some sort of correlation among columns with respect to yearly amount spent.

In [None]:
sns.pairplot(customers)

From the pair plots, we can see that data distribution is quite normal, and that there is a clear correlation between length of membership and yearly amount spent.<br>
Let us find out more using heatmap

In [None]:
sns.heatmap(customers.corr(), linewidth=0.5, annot=True)

The above heatmap confirms the correlation between 'length of membership' and 'Yearly amount spent'. We can also see that there is good degree of correlation between 'Yearly amount spent' and the column 'Time on app'. Also lesser degree of correlation with 'Avg. Session length'

In [None]:
x = customers[['Time on App', 'Length of Membership']]
y = customers['Yearly Amount Spent']

For the time being let's skip 'Avg. Session Length' column since it has lesser correlation. We shall include it later and see if it yields considerably better results.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 50)

Splitting dataset into train and test , giving 30% as test data and 70% as train data

In [None]:
lm = LinearRegression()
lm.fit(x_train, y_train)

In [None]:
# Function to Plot Learning curve
def plot_lc(estimator, x, y, train_sizes):
    train_sizes, train_scores, test_scores = learning_curve(lm,x,y, train_sizes = train_sizes, cv = 5,
    scoring = 'neg_mean_squared_error')
    train_scores_mean = np.mean(-train_scores, axis=1)
    train_scores_std = np.std(-train_scores, axis=1)
    test_scores_mean = np.mean(-test_scores, axis=1)
    test_scores_std = np.std(-test_scores, axis=1)

    plt.style.use('seaborn')
    plt.plot(train_sizes, train_scores_mean, label = 'Training error')
    plt.plot(train_sizes, test_scores_mean, label = 'Validation error')
    plt.ylabel('MSE', fontsize = 14)
    plt.xlabel('Training set size', fontsize = 14)
    plt.title('Learning curve', fontsize = 18, y = 1.03)
    plt.legend()

In [None]:
print("Coeffs are Time on App : {0} , Length of Membership: {1}".format(lm.coef_[0], lm.coef_[1]))
print("Intercept : ",lm.intercept_)

In [None]:
result = lm.predict(x_test)

In [None]:
plt.scatter(y_test, result)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")

In [None]:
plot_lc(lm,x,y,np.linspace(5, len(x_train), 10, dtype='int'))

Learning curve for the linear regression model shows small gap between training and validation error, meaning that variance should be reduced.  

In [None]:
print('R2 score : ',metrics.r2_score(y_test, result))
print('Variance: ',metrics.explained_variance_score(y_test,result))
print('MSE: ', metrics.mean_squared_error(y_test,result))

The predicted values and actual values seem to be agreeing with each other and the R2 score is also ~ 0.88, which is seems good enough. But the MSE seems to be higher .
However, Let us add the column 'Avg. Session length' this time and check results to see if there's any improvement (if R2 score increases and MSE decreases).

In [None]:
x = customers[['Time on App', 'Length of Membership','Avg. Session Length']]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 50)

Splitting dataset into train and test , giving 30% as test data and 70% as train data

In [None]:
lm.fit(x_train, y_train)

In [None]:
print("Coeffs are Time on App : {0} , Length of Membership: {1} , Avg. Session Length: {2}".format(lm.coef_[0], lm.coef_[1], lm.coef_[2]))
print("Intercept : ",lm.intercept_)

In [None]:
result = lm.predict(x_test)

In [None]:
plt.scatter(y_test, result)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")

This time , the predicted vs actual values is giving a leaner graph, which is better. Lets look further into R2 score and MSE.

In [None]:
plot_lc(lm,x,y,np.linspace(5, len(x_train), 10, dtype='int'))

Learning curve for the linear regression model shows that the gap between training and validation error has reduced, meaning that variance is further reduced.

In [None]:
print('R2 score : ',metrics.r2_score(y_test, result))
print('Variance: ',metrics.explained_variance_score(y_test,result))
print('MSE ', metrics.mean_squared_error(y_test,result))

Addition of the column 'Avg. Session Length' has greatly improved the model for us with increased R2 score of 0.981 and reduced MSE of 118.68