# **Problem: Create a linear regression model (using scikit learn) on the 'average' dataset**

Python program to create a linear regression model (using scikit learn) on the 'average' dataset. At the end we will calculate accuracy as well.

Run all the cells. 

**Notes:**

Following things are needed to be checked before running the program.

1. sklearn module is needed to be installed in the local machine to run this program.
2. pandas module is needed to be installed in the local machine, to read CSV.
3. gdown module is needed to be installed in the local machine, to download the CSV file from the google drive.
4. Check whether you have given the correct file location of the csv file.
5.  Check whether you have access to the file.
6. Check whether the file format is correct. 

# Import Modules

In [1]:
# Import pandas to read csv
import pandas as pd

# Import train_test_split to split data as test and train
from sklearn.model_selection import train_test_split

# Import LinearRegression class to get linear regression object
from sklearn.linear_model import LinearRegression

# Import r2_score function to calculate accuracy
from sklearn.metrics import r2_score

# Import gdown module to download files from the google drive
import gdown

# Get the file from the google drive.

In [3]:
# Please use the same dataset
url = 'https://drive.google.com/file/d/1Hxksp6KSjoex0wdER032QypLUYmdLNeg/view?usp=sharing'

# Derive the file id from the url
file_id = url.split('/')[-2]

# Derive the download url of the file
download_url = 'https://drive.google.com/uc?id=' + file_id

# Give the location you want to save it in your local machine
file_location = 'average.csv'

# Download the file from drive to your local machine
gdown.download(download_url, file_location, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1Hxksp6KSjoex0wdER032QypLUYmdLNeg
To: /content/average.csv
100%|██████████| 301k/301k [00:00<00:00, 66.3MB/s]


'average.csv'

# Create the linear regression model

In [4]:
# Read the CSV
average_dataset = pd.read_csv(file_location)

# Get independent variable columns
X = average_dataset[['A', 'B', 'C', 'D']]

# Get dependent variable columns
y = average_dataset['AVERAGE']

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Use LinearRegression class provided by sklearn
regressor = LinearRegression()

# Train the model
regressor.fit(X_train, y_train)

# Predict using test values
y_pred = regressor.predict(X_test)

# Get actual values and predicted values into a table
predicted_results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(predicted_results)

      Actual  Predicted
9394  541.00     541.00
898   554.50     554.50
2398  509.75     509.75
5906  607.00     607.00
2343  595.50     595.50
...      ...        ...
1037  490.00     490.00
2899  761.75     761.75
9549  313.25     313.25
2740  639.25     639.25
6690  407.50     407.50

[2000 rows x 2 columns]


# Calculate the accuracy

In [5]:
# Calculate accuracy using 'r2_score'
accuracy = r2_score(y_test, y_pred)
print('Accuracy of your model :',accuracy)

Accuracy of your model : 1.0
