# Subject: Classical Data Analysis

## Session 1 - Regression

### Individual assignment 1

Develop a regression analysis in Statmodels (with and without a constant) and SKLearn, based on the Iris sklearn dataset. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length.

See here for more information on this dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set 

Use the field “sepal width (cm)” as independent variable and the field “sepal length (cm)” as dependent variable.

- Interpret and discuss the OLS Regression Results.
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.

The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.

# Linear Regression in Statsmodels

## Load the iris dataset

> Put your code here

In [3]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
from sklearn import datasets

data = datasets.load_iris()

In [18]:
print(data.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [19]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [13]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [21]:
df = pd.DataFrame(data.data, columns=data.feature_names)
print(df)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
5                  5.4               3.9                1.7               0.4
6                  4.6               3.4                1.4               0.3
7                  5.0               3.4                1.5               0.2
8                  4.4               2.9                1.4               0.2
9                  4.9               3.1                1.5               0.1
10                 5.4               3.7                1.5               0.2
11                 4.8               3.4                1.6     

In [22]:
target = pd.DataFrame(data.target, columns=["sepal length (cm)"])

### Regression model with Statsmodels and without a constant:

> Put your code here

In [25]:
import statsmodels.api as sm

X = df["sepal width (cm)"]
y = target["sepal length (cm)"]

model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()

0,1,2,3
Dep. Variable:,sepal length (cm),R-squared:,0.533
Model:,OLS,Adj. R-squared:,0.529
Method:,Least Squares,F-statistic:,169.8
Date:,"Fri, 13 Oct 2017",Prob (F-statistic):,2.19e-26
Time:,21:24:45,Log-Likelihood:,-194.11
No. Observations:,150,AIC:,390.2
Df Residuals:,149,BIC:,393.2
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
sepal width (cm),0.3055,0.023,13.030,0.000,0.259,0.352

0,1,2,3
Omnibus:,408.899,Durbin-Watson:,0.04
Prob(Omnibus):,0.0,Jarque-Bera (JB):,13.753
Skew:,-0.151,Prob(JB):,0.00103
Kurtosis:,1.548,Cond. No.,1.0


### Interpreting the Table 

> Answer question here

> Rsquared shows that the line can explain 53.3% of the data observed
> for every 1cm change in width, you can expect an average increase of 0.3055cm in length.
> the Pvalue shows 0.00, low std. error and high confidence interval of 97.5% means that there is a good correlation between length and width of the sepal. 

### Regression model with Statsmodels and with a constant:

> Put your code here

In [26]:
X = df["sepal width (cm)"]
y = target["sepal length (cm)"]

X = sm.add_constant(X)
model = sm.OLS(y,X).fit()
predictions = model.predict()
model.summary()

0,1,2,3
Dep. Variable:,sepal length (cm),R-squared:,0.176
Model:,OLS,Adj. R-squared:,0.17
Method:,Least Squares,F-statistic:,31.6
Date:,"Fri, 13 Oct 2017",Prob (F-statistic):,9.16e-08
Time:,21:26:33,Log-Likelihood:,-167.92
No. Observations:,150,AIC:,339.8
Df Residuals:,148,BIC:,345.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.4203,0.435,7.865,0.000,2.561,4.280
sepal width (cm),-0.7925,0.141,-5.621,0.000,-1.071,-0.514

0,1,2,3
Omnibus:,38.964,Durbin-Watson:,0.272
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10.165
Skew:,0.329,Prob(JB):,0.0062
Kurtosis:,1.907,Cond. No.,24.3


### Interpreting the Table 

> Answer question here


> Rsquared is only 17.6%, quiet low. This means that the line varies a lot from the observed data for sepal width. 

> there is a negative correlation between the sepal's width and height

> for every 1cm increase in sepal's width, you can expect length to decrease by an average of 0.7925cm,
thus there is little change in the width. the same is true when there is a decrease in sepal's width, there will be an increase of 0.7925cm in length.

> Pvalue shows that this predictor (length) could be meaningful in determining the change in sepal's width
> 

# Linear Regression in SKLearn 

> Put your code here

In [36]:
from sklearn import linear_model, datasets
import pandas as pd
import numpy as np

data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

In [41]:
df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


In [43]:
df_x = pd.DataFrame(df, columns=["sepal width (cm)"])
target_y= pd.DataFrame(df, columns=["sepal length (cm)"])

In [45]:
X = df_x
y = target_y["sepal length (cm)"]

In [46]:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
predictions = lm.predict(X)

In [48]:
lm.score(X,y)

0.011961632834767699

In [50]:
lm.coef_

array([-0.20887029])

In [51]:
lm.intercept_

6.4812232114596053