# AUTHOR : Shree Nidhi


# LINEAR REGRESSION WITH PYTHON SCIKIT LEARN

In this section we will see how the Python Scikit-Learn library for machine learning can be used to implement regression function. We will start with simple linear regression involving two variables

# SIMPLE LINEAR REGRESSION

In this regression task we will predict the percentage of marks that a student is expected to score based upon the number of hours they studied. This is a simple linear regression task as it involves just two variables.

In [None]:
# Importing all libraries required

import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

In [None]:
# Reading data from given link

url = "http://bit.ly/w-data"
s_data = pd.read_csv(url)
print("Data imported successfully...")
s_data.head(10)

Let's plot our data points on 2-D graph to eyeball our dataset and see if we can manually find any relationship between the data. We can create the plot with the following script:

In [None]:
# Plotting the graph of the distribution of scores
s_data.plot(x='Hours', y='Scores', style='o')
plt.title('Hours vs Percentage')
plt.xlabel('Hours Studied')
plt.ylabel('Percentage Score')
plt.show()

From this graph we can clearly see that there is a positive linear relation between the number of hours studied and percentage of score.



# PREPARING THE DATA

The next step is to divide the data into inputs ("attributes") and outputs ("labels).

In [None]:
x=s_data.iloc[:, :-1].values
y= s_data.iloc[:, 1].values


Now, that we have our inputs and outputs. The next step is to split this data into training and test sets. We'll do this by using Scikit-Learn's built-in train_test_split() method:


In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)


# TRAINING THE ALGORITHM

We have split our data into training and testing sets, and now is finally the time to train our algorithm.

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit (x_train, y_train)
print("Training Complete.")

In [None]:
# Plotting the regression line
line = regressor.coef_*x+regressor.intercept_

# Plotting for the test data
plt.scatter(x, y)
plt.plot(x, line);
plt.show()

# MAKING PREDICTIONS

It's time to make some predictions.

In [None]:
print(x_test) # Testing data - In hours
y_pred = regressor.predict(x_test) # Predicting the scores

In [None]:
# Comparing Actual with Predicted
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df

In [None]:
# You can also test with your own data
hours = 9.25
own_pred = regressor.predict([[hours]])
print("No of hours = {}".format(hours))
print("Predicted Score = {}".format(own_pred))

# EVALUATING THE MODEL

The last step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For simplicity, we have chosen the mean square error. There are many such matrics.

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))