# Linear Regression Project
#### Benjamin Jones

I have forked the this project from "Benjamin Jones's Linear Regression Project" and added Coefficient of Correlation (R), Coefficient of determination (Squared R), and implementation of the model using stats.formula.model

### Imports:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Data Exploration Analysis (EDA)

Let's get the data and explore it:

Data set is available in kaggle

In [None]:
df = pd.read_csv('../input/ecommerce-customers/Ecommerce Customers.csv')

In [None]:
df.head().transpose()

In [None]:
df.columns

In [None]:
cols=df.iloc[:,3:]
cols
rows=df.iloc[:,3:]
rows
from scipy.stats import pearsonr
for x in rows.columns:
    for y in cols.columns:
        if x!=y:
            print("R("+x+","+y+")is:"+"  "+(str(pearsonr(df[x],df[y])[0])))

Coefficient of Correlation of all attributes have been calculated.

No two attributes are correlated so that the main assumption of linear regression is achieved (i.e In linear regression no two dimensions in feature space are strongly related). Hence we can make model with one target Yearly Amount Spent and 4 Variables (Avg. Session Length,Time on App,Time on Website, and Length of Membership)


* If you dont include Time on Website then also there will be no problem. 
* There is slight change in mean squared error.
* There is low variance factor in the dataset
* Hence no need of DImensionality Reduction
* Except Length of Memberhip other 3 Attributes are giving the Coefficient of Correlation as less than 0.5 (There may be a chance of statistical fluke in the given data)

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
df.describe()

Let's use Seaborn to plot some graphs to compare the relationships between the columns:

In [None]:
sns.pairplot(df,diag_kind='kde')

Given pairplot we have approximately all symmetrical normal distrubutions curves for all the dimensions

We can also construct a heatmap of these correlations:

In [None]:
sns.heatmap(df.corr(),cmap = 'Blues', annot=True)

We can see that there is a strong correlation between Length of Membership and Yearly Amount Spent

## Splitting the Data

Let's split the data into training and testing data. The feature we are interested in predicting is the Yearly Amount Spent.

In [None]:
X = df[['Avg. Session Length','Time on App','Time on Website','Length of Membership']]
y = df['Yearly Amount Spent']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=101)

## Training the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(X_train,y_train)

In [None]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [None]:
lm.coef_

## Predicting the Model

Let's see how well our model performs on the test data (for which we already have the labels)

In [None]:
pred = lm.predict(X_test)

In [None]:
plt.scatter(y_test,pred)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

We can see that our model is pretty good!

## Evaluating the Model

Let's calculate some errors:

In [None]:
from sklearn import metrics

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

In [None]:
print("Performance of the model: "+str(metrics.r2_score(y_test,pred)*100))

Performance of the model is 98.5

## Stats of the model using statsmodel

In [None]:
import statsmodels.formula.api as smf
ecommerce=pd.concat([X_train,y_train] , axis=1)
ecommerce=ecommerce.rename(columns={'Avg. Session Length':'Avg_Session_Length','Time on App':'Time_on_App','Time on Website':'Time_on_Website','Length of Membership':'Length_of_Membership','Yearly Amount Spent':'Yearly_Amount_Spent'})
ecommerce

In [None]:
linearmodel=smf.ols(formula='Yearly_Amount_Spent ~ Avg_Session_Length + Time_on_App + Time_on_Website + Length_of_Membership',data=ecommerce)
linearmodelfit=linearmodel.fit()

Linear model i.e ols (least squares) has been modelled using stats.formula.api

In [None]:
linearmodelfit.params

We got parameters or coefficients of the model along with intercept

In [None]:
print(linearmodelfit.summary())

* "std err" is the standard deviation of the distribution curve of all possible feature (Ex: look at Avg_Session_Length ,while modelling we get lots of possible values for Avg_Session_Length, "std err" represents the standard deviation of the distribution of all possible values of Avg_Session_Length)

* "t" represents the same as z-score that is the difference between best coefficient to the central value of the distribution of all possible values of a column

* "p_score" is used to find whether there is any statistical fluke or not (Ex: p_score determines the probability of relationship between the variable and the target in the population where there is no relation between the variable and target) 

* we got "p_value" for Time_on_Website as 0.773 so that we may conclude that this attribute might be fluked or in another modelling this may give less powered model.

## Conclusions

In [None]:
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

In [None]:
intercept= lm.intercept_
intercept

Regression Line is    "Y=25.9 * Avg.Session Length + 38.1 * Time on App + 0.2 * Time on Website + 61.6 * Length of Membership - 1037.9"

These numbers mean that holding all other features fixed, a 1 unit increase in Avg. Session Length will lead to an increase in $25.981550 in Yearly Amount Spent, and similarly for the other features

So as Time on App is a much more significant factor than Time on Website, the company has a choice: they could either focus all the attention into the App as that is what is bringing the most money in, or they could focus on the Website as it is performing so poorly!

### Thanks for reading!

G Tharun Kumar 

--ref [Benjamin Jones]