## 1 - SetUp Environment

In [1]:
import numpy as np
import pandas as pd
import pickle

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2 - Load Dataframe

now we should load the dataframe that we saved in the previous section.

In [2]:
with open('/content/drive/MyDrive/Python/Regression/Assets/df(1.Final).pickle', 'rb') as file:
    df = pickle.load(file)

df.head(3)

Unnamed: 0,Year,Month,Week Day,Duration,Cost,Team Member,Height,Frequency,Signal Strength,Antenna Type,Orientation,Power Supply,Zone
0,2019,3,0,241.0,516773.0,12.0,24.0,Very Low Frequencies (VLF),3,Dielectric,Omni-directional,Solar-powered,North
1,2019,10,2,608.0,954888.0,22.0,42.0,Very Low Frequencies (VLF),3,Dielectric,Circular,Active,Center
2,2019,6,5,772.0,932640.0,14.0,43.0,Very High Frequencies (VHF),3,Printed Circuit Board (PCB),Horizontal,Active,Center


## 3 - Correlation Analysis

for analysin the correlation, we should consider the data type. The target variable is cost that is numerical, and also we have other variables that are numerical and categorical. Based on the correlation rules:
*   Cost ~ Numerical variable : Pearson Correlation
*   Cost ~ Categorical variable : Anova Test

### 3.1 - Pearson Correlation

the numerical variables are (except target):
*   Duration
*   Team Member
*   Height

In [None]:
pearson_cor = df[['Cost' , 'Duration' , 'Team Member' , 'Height']].corr()

In [None]:
pearson_cor

Unnamed: 0,Cost,Duration,Team Member,Height
Cost,1.0,0.930659,-0.035774,-0.078127
Duration,0.930659,1.0,-0.029385,-0.069128
Team Member,-0.035774,-0.029385,1.0,0.021704
Height,-0.078127,-0.069128,0.021704,1.0


the dependent (target) variable is cost and other variabes are independent. so let's see the correlation score just for cost.

In [None]:
pearson_cor['Cost'].sort_values

<bound method Series.sort_values of Cost           1.000000
Duration       0.930659
Team Member   -0.035774
Height        -0.078127
Name: Cost, dtype: float64>

we can interpret the result of the table:
*   there is very strong relationship between Cost ~ Duration, and this relationship is positive
*   there are very weak relationships between Cost ~ Team Member & Height, and these relationships are negative.

### 3.2 - Anova Test

now we use Anova Test to determine if there is a significant difference between the means of all pairwise categorical variables combinations. the categorical variables are:
*   Year
*   Month
*   Week Day
*   Frequency
*   Signal Strength
*   Antenna Type
*   Orientation
*   Power Supply
*   Zone


the ANOVA test is looking for analysis the variance differeneces between all groups of a categorical variable. It uses the hyphothesis test to realize that, and it should be run for all categorizal variables to see whether they are important for project or not (in terms of predicting cost, in this project).

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#### 3.2.1 - Year

In [None]:
model = ols('Cost ~ Year', data = df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Year,21022940000000.0,1.0,84.384859,7.238285e-18
Residual,73992100000000.0,297.0,,


the p-value is 0.000000000000000007238285, and it is less than 0.05. So, we reject the null.
</br>
*   <i>Null hypothesis: The year has no significant influence on cost </i>
*   <u><i>Alternative hypothesis: The year has many significant influence on cost</i></u>


So, it's suggested to include the Year variable for predicting the cost. However, in feature selection section, we'll analyse more.

#### 3.2.2 - Month

In [None]:
model = ols('Cost ~ Month', data = df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Month,971631800000.0,1.0,3.068526,0.080854
Residual,94043410000000.0,297.0,,


the p-value is 0.080854, and it is more than 0.05. So, we accept the null.
</br>
*   <u><i>Null hypothesis: The month has no significant influence on cost </i></u>
*   <i>Alternative hypothesis: The month has many significant influence on cost</i>


So, it's suggested to not include the Month variable for predicting the cost. However, in feature selection section, we'll analyse more.

#### 3.2.3 - Week Day

In [None]:
df.rename(columns={'Week Day': 'WeekDay'}, inplace=True)
model = ols('Cost ~ WeekDay', data=df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
WeekDay,13186600000.0,1.0,0.041225,0.839244
Residual,95001850000000.0,297.0,,


the p-value is 0.839244, and it is more than 0.05. So, we accept the null.
</br>
*   <u><i>Null hypothesis: The week day has no significant influence on cost </i></u>
*   <i>Alternative hypothesis: The week day has many significant influence on cost</i>


So, it's suggested to not include the Week Day variable for predicting the cost. However, in feature selection section, we'll analyse more.

#### 3.2.4 - Frequency

In [None]:
model = ols('Cost ~ Frequency', data=df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Frequency,4029957000000.0,7.0,1.841303,0.079245
Residual,90985080000000.0,291.0,,


the p-value is 0.079245, and it is more than 0.05. So, we accept the null.
</br>
*   <u><i>Null hypothesis: The frequency has no significant influence on cost </i></u>
*   <i>Alternative hypothesis: The frequency has many significant influence on cost</i>


So, it's suggested to not include the frequency variable for predicting the cost. However, in feature selection section, we'll analyse more.

#### 3.2.5 - Signal Strength

In [None]:
df.rename(columns={'Signal Strength': 'Signal_Strength'}, inplace=True)
model = ols('Cost ~ Signal_Strength', data=df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Signal_Strength,791397000000.0,1.0,2.494543,0.115306
Residual,94223640000000.0,297.0,,


the p-value is 0.115306, and it is more than 0.05. So, we accept the null.
</br>
*   <u><i>Null hypothesis: The Signal_Strength has no significant influence on cost </i></u>
*   <i>Alternative hypothesis: The Signal_Strength has many significant influence on cost</i>


So, it's suggested to not include the Signal_Strength variable for predicting the cost. However, in feature selection section, we'll analyse more.

#### 3.2.6 - Antenna Type

In [None]:
df.rename(columns={'Antenna Type': 'Antenna_Type'}, inplace=True)
model = ols('Cost ~ Antenna_Type', data=df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Antenna_Type,617325600000.0,5.0,0.383222,0.860151
Residual,94397710000000.0,293.0,,


the p-value is 0.860151, and it is more than 0.05. So, we accept the null.
</br>
*   <u><i>Null hypothesis: The Antenna_Type has no significant influence on cost </i></u>
*   <i>Alternative hypothesis: The Antenna_Type has many significant influence on cost</i>


So, it's suggested to not include the Antenna_Type variable for predicting the cost. However, in feature selection section, we'll analyse more.

#### 3.2.7 - Orientation

In [None]:
model = ols('Cost ~ Orientation', data=df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Orientation,183091900000.0,3.0,0.189852,0.903271
Residual,94831950000000.0,295.0,,


the p-value is 0.903271, and it is more than 0.05. So, we accept the null.
</br>
*   <u><i>Null hypothesis: The Orientation has no significant influence on cost </i></u>
*   <i>Alternative hypothesis: The Orientation has many significant influence on cost</i>


So, it's suggested to not include the Orientation variable for predicting the cost. However, in feature selection section, we'll analyse more.

#### 3.2.8 - Power Supply

In [None]:
df.rename(columns={'Power Supply': 'Power_Supply'}, inplace=True)
model = ols('Cost ~ Power_Supply', data=df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Power_Supply,2409546000000.0,4.0,1.912431,0.108341
Residual,92605490000000.0,294.0,,


the p-value is 0.108341, and it is more than 0.05. So, we accept the null.
</br>
*   <u><i>Null hypothesis: The Power_Supply has no significant influence on cost </i></u>
*   <i>Alternative hypothesis: The Power_Supply has many significant influence on cost</i>


So, it's suggested to not include the Power_Supply variable for predicting the cost. However, in feature selection section, we'll analyse more.

#### 3.2.9 - Zone

In [None]:
model = ols('Cost ~ Zone', data=df).fit()
sm.stats.anova_lm(model, typ = 2)

Unnamed: 0,sum_sq,df,F,PR(>F)
Zone,946596500000.0,4.0,0.739619,0.565607
Residual,94068440000000.0,294.0,,


the p-value is 0.565607, and it is more than 0.05. So, we accept the null.
</br>
*   <u><i>Null hypothesis: The Zone has no significant influence on cost </i></u>
*   <i>Alternative hypothesis: The Zone has many significant influence on cost</i>


So, it's suggested to not include the Zone variable for predicting the cost. However, in feature selection section, we'll analyse more.