### Open the CO2 notebook we have worked on 

#### The goal is to use the month and year attributes/features as a categorical variable to use in multiple regression
- Do you understand why it's problematic to leave it as an integer type? 

#### 1. Load the co2 data 

In [1]:
import numpy as np
import pandas as pd

url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt' #file link

cols = ['Year','Month','Date', 'average','interpolated','trend','days'] #col names

df=pd.read_csv(url,comment='#', delim_whitespace=True, names = cols) #load file

df.head(5) #check

Unnamed: 0,Year,Month,Date,average,interpolated,trend,days
0,1958,3,1958.208,315.71,315.71,314.62,-1
1,1958,4,1958.292,317.45,317.45,315.29,-1
2,1958,5,1958.375,317.5,317.5,314.71,-1
3,1958,6,1958.458,-99.99,317.1,314.85,-1
4,1958,7,1958.542,315.86,315.86,314.98,-1


In [3]:
df.insert(len(df.columns),'Day',15) #insert col at the end

In [4]:
df.head()

Unnamed: 0,Year,Month,Date,average,interpolated,trend,days,Day
0,1958,3,1958.208,315.71,315.71,314.62,-1,15
1,1958,4,1958.292,317.45,317.45,315.29,-1,15
2,1958,5,1958.375,317.5,317.5,314.71,-1,15
3,1958,6,1958.458,-99.99,317.1,314.85,-1,15
4,1958,7,1958.542,315.86,315.86,314.98,-1,15


#### 2. Follow all steps from previous notebook to get it to the state where we have 4 columns (year, month, day, CO2)

In [5]:
cols_sub = ['Year','Month','Day','average']
df_CO2 = df[cols_sub].copy()
df_CO2.head() #check

Unnamed: 0,Year,Month,Day,average
0,1958,3,15,315.71
1,1958,4,15,317.45
2,1958,5,15,317.5
3,1958,6,15,-99.99
4,1958,7,15,315.86


In [6]:
df_CO2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 735 entries, 0 to 734
Data columns (total 4 columns):
Year       735 non-null int64
Month      735 non-null int64
Day        735 non-null int64
average    735 non-null float64
dtypes: float64(1), int64(3)
memory usage: 23.0 KB


#### 3. Transform month and year to categorical variables 

`Use pd.get_dummies`

In [7]:
pd.get_dummies(df_CO2,columns=['Year','Month'])

Unnamed: 0,Day,average,Year_1958,Year_1959,Year_1960,Year_1961,Year_1962,Year_1963,Year_1964,Year_1965,...,Month_3,Month_4,Month_5,Month_6,Month_7,Month_8,Month_9,Month_10,Month_11,Month_12
0,15,315.71,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,15,317.45,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,15,317.50,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,15,-99.99,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,15,315.86,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
5,15,314.93,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
6,15,313.20,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,15,-99.99,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
8,15,313.33,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,15,314.67,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


#### 4. For practice purposes only let's create a function that gets a df and a column name and outputs a normalized column
- a column of our choice --> here we'll test the CO2 column 
`use the np.mean function and np.std `
- (scaling vs standarization)https://www.statisticshowto.datasciencecentral.com/normalized/

In [9]:
df['average'].mean(), df['average'].std()

(350.22138775510206, 52.172570551608054)

In [11]:
def normalize_column(df,col_name):
     return (df[col_name]-df[col_name].mean())/df[col_name].std()

In [12]:
normalize_column(df_CO2,'average')

0     -0.661485
1     -0.628134
2     -0.627176
3     -8.629274
4     -0.658610
5     -0.676436
6     -0.709595
7     -8.629274
8     -0.707103
9     -0.681419
10    -0.663210
11    -0.648643
12    -0.642318
13    -0.622959
14    -0.612034
15    -0.614717
16    -0.645577
17    -0.678927
18    -0.697328
19    -0.708445
20    -0.678927
21    -0.663977
22    -0.647685
23    -0.637335
24    -0.625643
25    -0.598042
26    -0.578683
27    -0.587117
28    -0.614142
29    -0.657652
         ...   
705    1.039408
706    1.072376
707    1.077934
708    1.092502
709    1.127386
710    1.139844
711    1.124319
712    1.090585
713    1.052442
714    1.018708
715    1.023691
716    1.052250
717    1.084643
718    1.106685
719    1.113585
720    1.134478
721    1.150386
722    1.169553
723    1.160928
724    1.121061
725    1.088093
726    1.059726
727    1.069118
728    1.107835
729    1.127961
730    1.161695
731    1.179329
732    1.183546
733    1.209421
734    1.235105
Name: average, Length: 7

#### 5. Push all your changes to github

#### 6. Open the suicide_rates file and notebook 

#### 7. Apply normalization and dummies wherever seems appropriate to you to prepare the data for some analysis - this is a discussion to be done in pairs or triplets.

#### 8. Push any changes to github

#### 9. Send your Coach the last value of normalized C02 you have recieved 

