In [1]:
'''This project involves analyzing an advertising dataset to predict sales based on TV and radio advertising budgets. The data is first loaded and cleaned by removing unnecessary columns and handling missing values. The dataset is then split into training and testing sets. StandardScaler is used to normalize the features. A Linear Regression model is created and evaluated, with metrics like coefficients, intercepts, training score, and testing score being computed. Overfitting or underfitting is assessed based on the difference between training and testing scores. Polynomial regression is also explored by transforming the features to polynomial degrees from 2 to 6 and evaluating the model's performance at each degree. The project aims to identify the best model to predict sales accurately based on the given advertising data.'''

"This project involves analyzing an advertising dataset to predict sales based on TV and radio advertising budgets. The data is first loaded and cleaned by removing unnecessary columns and handling missing values. The dataset is then split into training and testing sets. StandardScaler is used to normalize the features. A Linear Regression model is created and evaluated, with metrics like coefficients, intercepts, training score, and testing score being computed. Overfitting or underfitting is assessed based on the difference between training and testing scores. Polynomial regression is also explored by transforming the features to polynomial degrees from 2 to 6 and evaluating the model's performance at each degree. The project aims to identify the best model to predict sales accurately based on the given advertising data."

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

import warnings
warnings.filterwarnings('ignore')

In [3]:
df=pd.read_csv("advertising.csv")

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [5]:
df.shape

(200, 5)

In [6]:
df.drop("Unnamed: 0",axis=1,inplace=True)

In [7]:
df.isnull().sum()

TV           0
radio        0
newspaper    0
sales        0
dtype: int64

In [8]:
df.dtypes

TV           float64
radio        float64
newspaper    float64
sales        float64
dtype: object

In [9]:
df.duplicated().sum()

0

In [10]:
X=df[["TV","radio"]] 
Y=df["sales"]
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3, random_state=1)

In [11]:
ss=StandardScaler()
X_train=ss.fit_transform(X_train) 
X_test=ss.transform(X_test) 

In [12]:
def create_linearmodel(model):
    lr.fit(X_train,Y_train)
    Y_pred=lr.predict(X_test)
    print('Co-eff: ',lr.coef_)
    print('Intercept: ', lr.intercept_)

In [13]:
def fitness(model):
    Train_score=model.score(X_train, Y_train)
    print('Train_score: ', Train_score)
    Test_score=model.score(X_test, Y_test)
    print('Test_score: ', Test_score)
    Difference = (Train_score-Test_score).round(4)
    print('Difference:', Difference)
    if Train_score>Test_score:
        if Difference>0.05:
            print('Overfit')
        else:
            print('Fine')
    else:
            print ('Underfit')

In [14]:
lr = LinearRegression()
create_linearmodel(lr)
fitness(lr)

Co-eff:  [4.06756287 2.70136813]
Intercept:  13.79142857142857
Train_score:  0.8849581188519494
Test_score:  0.9230321850256801
Difference: -0.0381
Underfit


In [15]:
for i in range(2,7):
    print(i)
    pf=PolynomialFeatures(i)
    X_poly=pf.fit_transform(X)
    X_train,X_test,Y_train,Y_test=train_test_split(X_poly,Y,test_size=0.3,random_state=1)   
    X_train=ss.fit_transform(X_train)
    X_test=ss.transform(X_test)
    create_linearmodel(lr)
    fitness(lr)
    print('')

2
Co-eff:  [ 0.          4.57360642  0.32267979 -3.00907384  3.69479751  0.18806517]
Intercept:  13.791428571428574
Train_score:  0.9830907214023425
Test_score:  0.9930704848288282
Difference: -0.01
Underfit

3
Co-eff:  [  0.           7.93573147   0.34266552 -12.01680893   4.4228605
  -0.12152205   5.95485265  -0.65814802  -0.17584246   0.32469645]
Intercept:  13.791428571428577
Train_score:  0.9896612551864068
Test_score:  0.9941836818138715
Difference: -0.0045
Underfit

4
Co-eff:  [  0.          12.2136094   -0.93748866 -31.59786383   5.10471306
   5.08427118  34.79759882  -0.87986617  -2.13185157  -6.13417063
 -13.75195341   0.72191463  -0.4644461    1.38037443   2.57346455]
Intercept:  13.79142857142858
Train_score:  0.9919751236711117
Test_score:  0.9957039491505716
Difference: -0.0037
Underfit

5
Co-eff:  [  0.          13.94112601  -6.25682255 -55.29302753  23.35529503
  26.3434794  105.9675238  -27.90591034 -40.24282092 -38.64902293
 -96.41855496  22.55235023  22.86629255  36.

In [16]:
pf=PolynomialFeatures(2)
X_poly=pf.fit_transform(X)
X_train,X_test,Y_train,Y_test=train_test_split(X_poly,Y,test_size=0.3,random_state=1)   
X_train=ss.fit_transform(X_train)
X_test=ss.transform(X_test)
create_linearmodel(lr)
fitness(lr)

Co-eff:  [ 0.          4.57360642  0.32267979 -3.00907384  3.69479751  0.18806517]
Intercept:  13.791428571428574
Train_score:  0.9830907214023425
Test_score:  0.9930704848288282
Difference: -0.01
Underfit
