***Importing needed packages***

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

***Reading in the dataset***

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("../input/oc2emission/FuelConsumptionCo2.csv")
df

***Data Exploration***

In [None]:
df.head()

In [None]:
df.describe()

Selecting possible features

In [None]:
cdf = df[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_COMB", "CO2EMISSIONS"]]
cdf

Visualizing the possible selected features

In [None]:
viz = cdf
viz.hist()
plt.show()

Plotting different features against the label "CO2EMISSIONS" to see how linear their relationship is.

In [None]:
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color = "yellow")
plt.xlabel("ENGINESIZE")
plt.ylabel("CO2EMISSIONS")
plt.show()

In [None]:
plt.scatter(cdf.CYLINDERS, cdf.CO2EMISSIONS, color = "yellow")
plt.xlabel("CYLINDERS")
plt.ylabel("CO2EMISSIONS")
plt.show()

In [None]:
plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color = "yellow")
plt.xlabel("FUELCONSUMPTION_COMB")
plt.ylabel("CO2EMISSIONS")
plt.show()

***Creating train and test dataset***

In [None]:
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]

***Modeling***

In [None]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[["ENGINESIZE"]])
train_y = np.asanyarray(train[["CO2EMISSIONS"]])
regr.fit(train_x, train_y)
print("Coefficient: ", regr.coef_)
print("Intercept: ", regr.intercept_)

Coefficient and Intercept in the simple linear regression, are the parameters of the fit line. Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters are the intercept and slope of the line, sklearn can estimate them directly from our data.

***Train data distribution***

In [None]:
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color = "blue")
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel("ENGINESIZE")
plt.ylabel("CO2EMISSIONS")
plt.show()

***Plot outputs***

we can plot the fit line over the data:

In [None]:
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color = "yellow")
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], "-r")
plt.xlabel("ENGINESIZE")
plt.ylabel("CO2EMISSIONS")

***Evaluation***

In [None]:
from sklearn.metrics import r2_score
test_x = np.asanyarray(test[["ENGINESIZE"]])
test_y = np.asanyarray(test[["CO2EMISSIONS"]])
test_y_hat = regr.predict(test_x)

Mean absolute error:

In [None]:
np.mean(np.absolute(test_y_hat - test_y))

Mean Square error:

In [None]:
np.mean((test_y_hat - test_y)**2)

R-squared (Coefficient of determination):

In [None]:
r2_score(test_y_hat, test_y)