# **Imports**

Import pandas, numpy, matplotlib and seaborn

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline
sns.set_style('whitegrid')

# **Get the Data**

The Ecommerce Customers csv file from the company has data such as Email, Address and their color Avatar etc. Then there are also numerical columns:

* Avg. Session Length: Average session of in-store style advice sessions.
* Time on App: Average time spent on App in minutes
* Time on Website: Average time spent on Website in minutes
* Length of Membership: How many years the customer has been a member.

In [1]:
customers = pd.read_csv('../input/ecommerce-customers/Ecommerce Customers.csv')

**Check the head of the customers, and info()and summary statistics of the dataframe.**

In [1]:
customers.head()

In [1]:
customers.info()

In [1]:
customers.shape

In [1]:
customers.describe().transpose()

# **Exploratory Data Analysis**

**Use seaborn to create a jointplot to compare the Time on Website or Time on App and Yearly Amount Spent.**

In [1]:
sns.jointplot(x='Time on Website',y='Yearly Amount Spent',data=customers,color='grey')

In [1]:
sns.jointplot(x='Time on App',y='Yearly Amount Spent',data=customers,color='grey')

In [1]:
sns.jointplot(x='Time on App',y='Length of Membership',data=customers,color='grey',kind='hex')

**Explore the relationships across the entire dataset using a pairplot**

In [1]:
sns.pairplot(data=customers)

Based off of the pairplot, looks like the Lenth of Membership and Yearly Amount Spent are the most correlated.

Create a linear model plot (lmplot seaborn) of Yearly Amount Spent vs. Length of Membership.

In [1]:
sns.lmplot(x='Yearly Amount Spent',y='Length of Membership',data=customers)

# Training and Testing the Data

Set a variable to the numerical features of the customers and a variable y equal to the 'Yearly Amount Spent'.

In [1]:
X = customers[['Avg. Session Length', 'Time on App',
       'Time on Website', 'Length of Membership']]
y = customers['Yearly Amount Spent']

Use a test size of 30% of the data and a random_state = 101

In [1]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Train/fit lm on the training data

In [1]:
from sklearn.linear_model import LinearRegression

In [1]:
lm = LinearRegression()

In [1]:
lm.fit(X_train,y_train)

**Print out the coefficient and intercept**

In [1]:
print('Coefficients:',lm.coef_)
print('\n')
print('Intercept:', lm.intercept_)

# **Predicting Test Data**

Use lm.predict() to predict off the X_test set of the data.

In [1]:
predictions = lm.predict(X_test)

Create a scatter plot of the test values vs. the predicted values.

In [1]:
plt.scatter(y_test,predictions)

# **Evaluating the Model**

Compute the Mean Absolute Error, Mean Squared Error and Root Mean Squared Error

In [1]:
from sklearn import metrics

In [1]:

print('MAE:',metrics.mean_absolute_error(y_test,predictions))
print('MSE:',metrics.mean_squared_error(y_test,predictions))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,predictions)))

# **Residuals**

Plot a histogram of the residuals to see if it is normally distributed.

In [1]:
sns.distplot((y_test-predictions),bins=45,color='grey')

# **Conclusions**

In [1]:
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

* Holding all other features constant, a 1 unit increase in Avg. Session Length translates to an increase of 25.98 total dollars spent.
* Holding all other features constant, a 1 unit increase in Time on App translates to an increase of 38.59 total dollars spent.
* Holding all other features constant, a 1 unit increase in Time on Website translates to an increase of 0.19 total dollars spent.
* Holding all other features constant, a 1 unit increase in Length of Membership translates to an increase of 61.27 total dollars spent.