#                                The Sparks Foundation : GRIP 
#                   Data Science and Business Analytics Internship 
#                                Author : Srushti Badukale
   #                                        Batch : Oct21

#               Task 1 : Prediction Using Supervised Machine Learning 
In this task, we will predict the percentage of students is expected to score based upon the number of hours they studied.

# Steps to follow:

  1. Importing the dataset
  2. Visualizing the dataset
  3. Data Preparation
  4. Training the Algorithm
  5. Visualizing our Model
  6. Making Predictions
  7. Evaluating the model

# Step1 : Importing the dataset
  We will import the dataset with the help of pandas libraries and we will observe the data.

In [None]:
#Importing all required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#%matplotlib inline allows plots to appear and save in this notebook
%matplotlib inline

In [None]:
#Reading data from remote link

url = "http://bit.ly/w-data"
data = pd.read_csv(url) 
print("Data imported successfully")

#using head() method to display the first few rows and to be imported
s_data.head(9)

In [None]:
#It is used to display the no. of rows and no. of columns of our dataset
s_data.shape

In [None]:
#using tail() method to display the last five rows
s_data.tail(5)

In [None]:
#info() method shows all the information about the class
s_data.info()

In [None]:
#It is used to describe all the values of our dataset
s_data.describe()

In [None]:
#This method shows that whether the values are null value or not.
s_data.isnull().sum()

As we can see that there is no null value in our dataset.

# Step 2 : Visualizing the dataset
  we will plot the dataset to check whether we can see any relation between two variables or not

In [None]:
# Plotting the distribution of scores

s_data.plot(x='Hours', y='Scores', color='blue', style='o', markersize=5)  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score') 
plt.grid()
plt.show()

From the above graph, we can clearly see that there is a positive relation between number of hours they studied and scores obtained.

In [None]:
#corr() method is used to determine the correlation between variables
s_data.corr()

# Step 3 : Data Preparation
 We will divide the dataset into attributes(input),labels(output) and split this dataset into two parts i.e. training data and testing data. 

In [None]:
#By using iloc function,we will divide the data
#iloc[<row selector><column selector>]
X=s_data.iloc[:,:-1].values   #Selecting all the rows and all the columns except last element
Y=s_data.iloc[:,1].values     #Selecting all the rows and second column

In [None]:
X

In [None]:
Y

Training data: Sample data that is used to train the model through which future prediction can be made.

Testing data: Part of data that is unseen by the model, it is used to test how good the relationship and prediction is made by the model.

In [None]:
#Splitting data into train and test data by using train_test_split utility from sklearn.model_selection

from sklearn.model_selection import train_test_split  
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                            test_size=0.2, random_state=0) 

# Step 4 : Training the Algorithm
we have splitted our data into training and testing sets now its time to train our algorithm.

In [None]:
from sklearn.linear_model import LinearRegression  
model = LinearRegression()  
model.fit(X_train, Y_train) #fit() method is used to train the model

print("Training Complete.")

# Step 5 : Visualizing our Model
We have trained our model now its time to visualize it.

In [None]:
#plotting the training data

m=model.coef_
c=model.intercept_
line=m*X_train+c
plt.scatter(X_train, Y_train, color='red')
plt.plot(X_train, line, color='green')
plt.grid()
plt.show()

In [None]:
#plotting the testing data

m=model.coef_
c=model.intercept_
line=m*X_test+c
plt.scatter(X_test, Y_test, color='red')
plt.plot(X_test, line, color='green')
plt.grid()
plt.show()

# Step 6 : Making Predictions
   we have trained our algorithm and its time to make some predictions.

In [None]:
print(X_test) # Testing data - In Hours
Y_pred = model.predict(X_test) # Predicting the scores

In [None]:
#Comparing dataframe between actual vs predicted values

df = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})  
print(df) 

Predicting the score of a student who studies for 9.25 hours per day:

In [None]:
#You can also test with your own data
Hours = 9.25
own_pred = model.predict([[Hours]])

print("If a person who studies for", Hours, "hours and Predicted score is", own_pred[0])

# Step 7 : Evaluating the Model
  This is the final step to evaluate the performance of algorithm.

In [None]:
from sklearn import metrics  
print('Mean Absolute Error:', 
       metrics.mean_absolute_error(Y_test, Y_pred))

In [None]:
from sklearn import metrics
print('Mean Squared Error:',
      metrics.mean_squared_error(Y_test, Y_pred))

In [None]:
from sklearn import metrics
print('R-squared Error:',
      metrics.r2_score(Y_test, Y_pred))