## THEORY QUESTIONS

**Q1) What is overfitting and how to avoid it?**  
**Answer -** Overfitting in data science means when model learns or fit the details and noise of training data to the extent that it negatively effect the performance, as model will fail to predict future observations reliabily.  
Ways to avoid Overfitting:- 


1.   Remove extra features
2.   To ensure number of data points are greater than independent variables
3.   Cross-Validation, this can be done by dividing Test cases in every proportion for decreasing biased outcomes.


**Q2) What is RMSE and MSE? How can you calculate them?**  
**Answer -** RMSE stands for Root Mean Square Error, it is the standard deviation of the predicted values.  
To calculate RMSE:-  


1.   calculate the mean of values  
$U = \frac{X1 + X2 + ...... + Xn}{n}$
2.   Sum the squared diffrence of mean value and actual value  
$\sum_{i=1}^n (Xi - U)^2$ 
3.   Calculate the mean of above Summation  
$\frac{\sum_{i=1}^n (Xi - U)^2}{n}$ 
4.   Take square root of above obtained value  
$RMSE =\sqrt{\frac{\sum_{i=1}^n (Xi - U)^2}{n}}$ 


MSE stands for Mean Square Error, it is the Average of the squared difference.  
To calculate MSE:-  


1.   calculate the mean of values  
$U = \frac{X1 + X2 + ...... + Xn}{n}$  
2. Sum the squared diffrence of mean value and actual value  
$\sum_{i=1}^n (Xi - U)^2$  
3. Calculate the mean of above Summation  
$MSE = \frac{\sum_{i=1}^n (Xi - U)^2}{n}$
 

**Q3) What is Line of best fit?**  
**Answer -** Line of best fit refers to a line through a scatter plot of data points that best expresses the relationship between those points. Statistical models used to find line of best fit (least squares method) to arrive at the geometric equation for the line.

**Q4) Explain multivariant linear regression using a real-life example.**  
**Answer -** Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.  
For example, we can estimate prices of grains in agricultural markets for every day.It's daily price(Y) fluctuations depend on last day's temperature(T), last day's humidity(H), last day's sold out stock(S), last day's market arrivals(A), last day's price of substitute commodity(C) etc. We can form following multiple regression equation.  
$Y = w0 + w1*T + w2*H + w3*S + w4*A + w5*C + error$  
Thus, from above example we can see that Y(daily price) is a response variable and other mentioned factors are independent/explanatory variables which are responsible for Y value.

**Q5) How can we improve the accuracy of a linear regression model?**  
**Answer -** To improve the accuracy of a linear regression model we can:-  


1.   Add more data points,"data to tell for itself". This will decrease the biased outcomes and increase the accuracy of model.
2.   By effectively treating missing and outlier values
3.   By calculating RMSE and MSE for the predicted and actual values, Lower the RMSE and MSE Higher the accuracy
4.   Algorithm tuning can be used by altering algorithm parameters which can give higher accuracy for certain parameters.



# SIMPLE LINEAR REGRESSION WITH PYTHON

In [None]:
# importing libraries to work with
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression

In [None]:
# creating data frame from CSV file using pandas library
data = pd.read_csv("../input/salary-data/salary_data.csv")
data.head(10)

Above Dataframe has only one independent variable and one dependent variable having 30 rows

Preprocessing of input data is necessary, Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data. To achieve mean value 0 and standard deviation of 1 we use `sklearn.preprocessing.StandardScaler().fit(X).transform(X.astype(float))` which transform all the data.

In [None]:
X = data[["YearsExperience"]].values
y = data[["Salary"]].values
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))

We split the data into Test and Train datsets using `sklearn.model_selection.train_test_split()`.  
We wrote a program to check which *test_size* give best acuurate model at *random_state=0*

In [None]:
k=0.1
x_rmse = [i/10 for i in range(1,10)]
rmse = []
for i in range(1,11):
  if i == 10:
    k = x_rmse[rmse.index(min(rmse))]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=k, random_state=0)
    ln_reg = LinearRegression().fit(X_train, y_train)
  else:
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=k, random_state=0)
     ln_reg = LinearRegression().fit(X_train, y_train)
     rmse.append((np.mean((ln_reg.predict(X_test) - y_test) ** 2))**(0.5))
     k += 0.1

plt.plot(x_rmse, rmse, color='red')
plt.title('Test_size VS RMSE')
plt.xlabel('Test_size')
plt.ylabel('RMSE')
plt.annotate('Minimum', xy =(k, min(rmse)), xytext=(k-0.06,min(rmse)+2000),arrowprops = dict(facecolor ='cyan',shrink = 0.05))
plt.show()



From above graph we observed at *test_size=0.2* model gives least RMSE value, Thus we split the train and test data according to it.
We formed Simple Linear Regression model using `sklearn.linear_model.LinearRegression().fit()` to form the model.

$R^2$  value is very high close to 1 which depicts that our model is accurate and 98.81% changeability can be explained by the model.  
$R^2 = 1 - \frac{u}{v}$  
$u$ : Residual sum of squares  
$v$ : Total sum of squares

In [None]:
ln_reg.score(X_test, y_test)

In [None]:
# Visualizing the Training set results
plt.scatter(X_train, y_train, color='green')
plt.plot(X_train, ln_reg.predict(X_train), color='blue')
plt.title('Salary VS Experience (Training set)')
plt.xlabel('Year of Experience')
plt.ylabel('Salary')
plt.show()

# Visualizing the Test set results
plt.scatter(X_test, y_test, color='green')
plt.plot(X_test, ln_reg.predict(X_test), color='blue')
plt.title('Salary VS Experience (Test set)')
plt.xlabel('Year of Experience')
plt.ylabel('Salary')
plt.show()