# Regression Exercise
The objective of this exercise is to build and evaluate regression models to predict total charge given information from customers of a telephone company (`data.csv`).

### Part 1: Data Preparation

In [1]:
# Load libraries
import pandas as pd
import numpy
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model

In [2]:
# Load dataset and display the first five rows
data = pd.read_csv('data.csv')
data.head()

Unnamed: 0,Account length,International plan,Voice mail plan,Number voice mail messages,Total day minutes,Total day calls,Total eve minutes,Total eve calls,Total night minutes,Total night calls,Total intl minutes,Total intl calls,Customer service calls,Total charge
0,128,0,1,25,265.1,110,197.4,99,244.7,91,10.0,3,1,75.56
1,107,0,1,26,161.6,123,195.5,103,254.4,103,13.7,3,1,59.24
2,137,0,0,0,243.4,114,121.2,110,162.6,104,12.2,5,0,62.29
3,84,1,0,0,299.4,71,61.9,88,196.9,89,6.6,7,2,66.8
4,75,1,0,0,166.7,113,148.3,122,186.9,121,10.1,3,3,52.09


**Task 01 (of 15): Partition the dataset into training set and test set using the `train_test_split` method.
Use 75% of the data for training and 25% for testing and set parameter `random_state` to 0.**

In [3]:
x_train, x_test, y_train, y_test = train_test_split(data.iloc[:,:-1], data['Total charge'], test_size=0.25, random_state=0)

**Task 02 (of 15): Standardize the training set and test set.**
_Hint:_ Compute the mean and standard deviation using only the training set to avoid introducing bias and then apply this transformation on the training set and test set.

In [4]:
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

### Part 2: Simple Linear Regression

**Task 03 (of 15): Build a simple linear model to predict 'Total charge' with 'Total day minutes' as the predictor and print the coefficient of the model.**
_Hint:_ `X` must be a 2D array.

In [5]:
model = linear_model.LinearRegression()
fitted_model = model.fit(x_train_scaled[:,4].reshape(-1,1),y_train)
print(fitted_model.coef_)

[9.30234225]


**Task 04 (of 15): Use the model to predict 'Total charge' for the test set.**
_Hint:_ `X` must be a 2D array.

In [6]:
predicted = fitted_model.predict(x_test_scaled[:,4].reshape(-1,1))

**Task 05 (of 15): Compute the coefficient of determination (R squared) of the model over the test set.**
_Hint:_ First compute the correlation coefficient between the predicted y-values and the observed y-values.

In [7]:
corr_coef = numpy.corrcoef(predicted,y_test)[1, 0]
R_squared = corr_coef**2
print(R_squared)

0.7858547147225989


**Question 01 (of 05): What can you conclude about the performance of the model?**

**Answer:** The model learns/can explain 78% of variability in the data. It means that the model is around 78% accurate.

### Part 3: Multiple Linear Regression

**Task 06 (of 15): Build a multiple linear model to predict 'Total charge' with 'Total day minutes', 'Total eve minutes', 'Total night minutes', and 'Total intl minutes' as predictors and print the coefficients of the model.**

In [8]:
model = linear_model.LinearRegression()
fitted_model = model.fit(x_train_scaled[:,[4,6,8,10]],y_train)
print(fitted_model.coef_)

[9.23422081 4.34962707 2.2813772  0.75805126]


**Task 07 (of 15): Use the model to predict 'Total charge' for the test set.**

In [9]:
predicted = fitted_model.predict(x_test_scaled[:,[4,6,8,10]])

**Task 08 (of 15): Compute the coefficient of determination (R squared) of the model over the test set.**
_Hint:_ First compute the correlation coefficient between the predicted y-values and the observed y-values.

In [10]:
corr_coef = numpy.corrcoef(predicted,y_test)[1, 0]
R_squared = corr_coef**2
print(R_squared)

0.9999997074154813


**Question 02 (of 05): What can you conclude about the performance of the model?**

**Answer:** The model learns/can explain 99.999% of the variance. It means that the model almost perfectly fits the data.

**Task 09 (of 15): Build a multiple linear model to predict 'Total charge' with all features as predictors and print the coefficients of the model.**

In [11]:
model = linear_model.LinearRegression()
fitted_model = model.fit(x_train_scaled,y_train)
print(fitted_model.coef_)

[ 1.86701207e-04 -5.41336738e-05  4.87160937e-04 -4.61769986e-04
  9.23422162e+00  2.04345920e-04  4.34963578e+00  8.54680666e-05
  2.28136883e+00  3.74275670e-05  7.58044203e-01  1.35159379e-04
 -2.43936617e-05]


**Task 10 (of 15): Use the model to predict 'Total charge' for the test set.**

In [12]:
predicted = fitted_model.predict(x_test_scaled)

**Task 11 (of 15): Compute the coefficient of determination (R squared) of the model over the test set.**
_Hint:_ First compute the correlation coefficient between the predicted y-values and the observed y-values.

In [13]:
corr_coef = numpy.corrcoef(predicted,y_test)[1, 0]
R_squared = corr_coef**2
print(R_squared)

0.9999997032282442


**Question 03 (of 05): What can you conclude about the performance of the model?**

**Answer:** Again, this model can explain 99.999% of the variability of the data. It means that the model almost perfectly fits the data.

### Part 4: Regularization

**Task 12 (of 15): Build a LASSO regression model to predict 'Total charge' with all features as predictors.**

In [14]:
model = linear_model.Lasso(alpha = 1)
fitted_model = model.fit(x_train_scaled,y_train)

**Task 13 (of 15): Print the coefficients of the model.**

In [15]:
print(fitted_model.coef_)

[-0.          0.          0.          0.          8.24885419  0.
  3.3331111  -0.          1.25156405  0.          0.          0.
 -0.        ]


**Task 14 (of 15): Use the model to predict 'Total charge' for the test set.**

In [16]:
predicted = fitted_model.predict(x_test_scaled)

**Task 15 (of 15): Compute the coefficient of determination (R squared) of the model over the test set.**
_Hint:_ First compute the correlation coefficient between the predicted y-values and the observed y-values.

In [17]:
corr_coef = numpy.corrcoef(predicted,y_test)[1, 0]
R_squared = corr_coef**2
print(R_squared)

0.985441651550435


**Question 04 (of 05): What can you conclude about the coefficients and the performance of the model?**

**Answer:** Only 4 variables are correlated to the output. All other variables have 0 correlation to the output.
Lasso Regression does a feature selection and gives 0 as the coefficient for variables that do not make a major contribution in model training.
The r square value indicates that the model can explain 99.88% of the variability in the data. It means the model is almost a perfect fit to the data.

**Question 05 (of 05): Based on all the results obtained, what are the most important variables to predict the total charge of a user? Justify your answer.**

**Answer:** The most important features are: ` 'Total day minutes', 'Total eve minutes', 'Total night minutes', and 'Total intl minutes'` because these variables together explain 99.9999%(got from r square) of the variability of the data. With the values of these variables, we can correctly predict the total charge for a particular customer. These variables are obvious too because the total usage will correspond to the total charge and the total usage can be calculated by the toal minutes used.