# A Machine Learning Model to Predict Unemployment Rate from Google Trends Data

This notebook will examine the correlation betweem unemployment and Google Trends data in the United States. On the first Friday of each month, at 8:30 AM, the U.S. Bureau of Labor Statistics (BLS) releases unemployment rate for the past month. BLS uses telephone surveys during one week of each month (usually the week containing the 12th day) to compute the unemployment rate. This notebook develops a machine learning model that would predict the unemplyment rate in the United States using Google Trends data. This tool will help banks and financial services provider companies to have a prediction on the unemployment rate before the official numbers are released from the Federal government. 

## Initialization

This section imports necessary packages and defines necessary parameters for development of the project. In general, the data science tools used are `NumPy`, `Pandas`, `Scikit-Learn`, and `SciPy`. Third party API `PyTrends` is used to download Google Trends data. The Department of Labor API will be used to download the unemployment rate.

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from pytrends.request import TrendReq
from scipy.stats import pearsonr
from sklearn.preprocessing import MinMaxScaler
from matplotlib import rcParams
from pandas.plotting import register_matplotlib_converters
#from wordcloud import WordCloud
register_matplotlib_converters()
#rcParams['font.family'] = 'Arial'
%matplotlib notebook

#Dictionary for State abbreviations
states_dict = {'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona',
               'AR': 'Arkansas', 'CA': 'California', 'CO': 'Colorado',
               'CT': 'Connecticut', 'DE': 'Delaware',
               'DC': 'District of Columbia', 'FL': 'Florida', 'GA': 'Georgia',
               'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana',
               'IA': 'Iowa', 'KS': 'Kansas', 'KY': 'Kentucky', 
               'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland',
               'MA': 'Massachusetts', 'MI': 'Michigan', 'MN': 'Minnesota',
               'MS': 'Mississippi', 'MO': 'Missouri', 'MT': 'Montana',
               'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire',
               'NJ': 'New Jersey', 'NM': 'New Mexico', 'NY': 'New York',
               'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio',
               'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania',
               'RI': 'Rhode Island', 'SC': 'South Carolina',
               'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas',
               'UT': 'Utah', 'VT': 'Vermont', 'VA': 'Virginia',
               'WA': 'Washington', 'WV': 'West Virginia', 'WI': 'Wisconsin',
               'WY': 'Wyoming', 'US': 'United States'}

## Downloading Data
This section sets up functions to use the `pytrends` API along with the U.S. Department of Labor web API to download the data if the associated files are not already stored on disk.

### Downloading Google Trends Data

In [3]:
def get_google_trends(kw_list, state, start_date, end_date):
    outname = '{}_{}_{}_{}.csv'.format(state, kw_list[0], start_date, end_date)
    outdir = './data/Google Trends'
    if not os.path.exists(outdir):
        os.makedirs(outdir)
    
    fullname = os.path.join(outdir, outname)   
    if not os.path.exists(fullname):
        pytrends = TrendReq(hl='en-US', tz=300)
        if state == 'US':
            geo = 'US' 
        else:
            geo = 'US-' + state

        pytrends.build_payload(kw_list, cat=0, timeframe=start_date+' '+end_date,
                               geo=geo, gprop='')
        df = pytrends.interest_over_time()
        df = df.reindex(pd.to_datetime(df.index))
        df['YearMonth'] = pd.to_datetime(df.index.strftime('%Y-%m'))
        table = pd.pivot_table(df, index='YearMonth', values=kw_list, aggfunc=np.sum)
        #offsetting 
        table.index = table.index+pd.DateOffset(months=0, days=0) 
        table.to_csv(fullname)
        return table

START_DATE = '2012-01-01'
END_DATE = '2020-09-30' 
KEYWORDS = ['bored', 'unemployment office', 'lunch']
states = list(states_dict.keys())

for STATE in states:         
    for KW_LIST in KEYWORDS:
        get_google_trends([KW_LIST], STATE, START_DATE, END_DATE)
        

## Downloading Unemployment Data

In [3]:
def get_states_links():
    df = pd.read_table('https://download.bls.gov/pub/time.series/la/la.txt', skiprows=104, nrows=52)
    i =1
    for col in df.columns:
        df.rename(columns={col:i}, inplace=True)
        i += 1
    states=df[3].to_list()
    for i in range(len(states)):
        states_abr = states[i].split(')')[-1].split('=')[0].strip()
        states[i] = states_abr
    df[3] = states
    df = df.set_index(3)
    df = df[2].str.strip()
    return df.to_dict()

def get_state_codes(): 
    df = pd.read_table('https://download.bls.gov/pub/time.series/la/la.txt', skiprows=104, nrows=52)
    i =1
    for col in df.columns:
        df.rename(columns={col:i}, inplace=True)
        i += 1
    states=df[3].to_list()
    for i in range(len(states)):
        states[i] = states[i].split(')')[-1].split('=')[0].strip()
    df['states'] = states    
    codes=df[3].to_list()
    for i in range(len(states)):
        codes[i] = codes[i].split(')')[0].split('=')[-1].strip()
    df['code'] = codes
    df.set_index('states', inplace=True)
    df = df['code']
    return df.to_dict()

def get_unemployment(state, start_date, end_date):
    outname = '{}_Unemployment_{}_{}.csv'.format(state, start_date, end_date)
    outdir = './data/Unemployment'
    fullname = os.path.join(outdir, outname)
    if os.path.exists(fullname):
        return None
    if not os.path.exists(outdir):
        os.makedirs(outdir)
    states_link = get_states_links()
    states_code = get_state_codes()
    if state == 'US':
        link = 'https://download.bls.gov/pub/time.series/ln/ln.data.1.AllData'
    else:
        link = 'https://download.bls.gov/pub/time.series/la/'+states_link[state]
    if state == 'US':
        mask = 'LNS14000000'
    else:
        mask = 'LASST'+str(states_code[state])+'00000000000'+'03'
    df = pd.read_table(link, delim_whitespace=True, low_memory=False)
    df = df[df['series_id']==mask]
    #offsetting
    df['YearMonth'] = (pd.to_datetime(df['year'].astype('str')+'-'+df['period'].str[1:])
                       +pd.DateOffset(months=0, days=0))
    df = df[(df['YearMonth']>=pd.to_datetime(start_date))&(df['YearMonth']<=pd.to_datetime(end_date))]
    df = df.set_index('YearMonth')
    df = pd.DataFrame(df['value'].astype('float64'))
    df.rename(columns={'value':'unemployment_rate'}, inplace=True)
    df.to_csv(fullname) 
    return df

states = list(states_dict.keys())

for STATE in states:
    get_unemployment(state=STATE, start_date=START_DATE, end_date=END_DATE)

## Exploratory Data Analysis

We'll begin by loading the unemployment data. Note the date format in the `DataFrame.index` represents the entire month. 

In [4]:
path = 'data/Unemployment/{}_Unemployment_{}_{}.csv'.format(states[0], START_DATE, END_DATE)
df_unemp = pd.read_csv(path, index_col=0)
df_unemp.index = pd.to_datetime(df_unemp.index)
df_unemp
df_unemp.rename(columns={'unemployment_rate':states[0]}, inplace=True)

for STATE in states:
    path = 'data/Unemployment/{}_Unemployment_{}_{}.csv'.format(STATE, START_DATE, END_DATE)
    df_unemp[STATE] = pd.read_csv(path, index_col=0).values

print(f'\nRECORDS FROM {df_unemp.index[0].date()} TO {df_unemp.index[-1].date()}')
df_unemp.head()


RECORDS FROM 2012-01-01 TO 2020-09-01


Unnamed: 0_level_0,AL,AK,AZ,AR,CA,CO,CT,DE,DC,FL,...,TN,TX,UT,VT,VA,WA,WV,WI,WY,US
YearMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-01-01,8.1,7.4,8.7,7.7,11.0,8.2,8.2,7.1,9.8,8.9,...,7.9,7.1,5.9,5.1,6.3,8.6,7.4,7.2,5.5,8.3
2012-02-01,7.9,7.3,8.7,7.6,10.9,8.1,8.2,7.1,9.6,8.8,...,7.8,7.0,5.8,5.0,6.2,8.5,7.3,7.1,5.4,8.3
2012-03-01,8.0,7.3,8.6,7.6,10.8,8.1,8.3,7.1,9.4,8.7,...,7.8,6.9,5.7,5.0,6.2,8.5,7.3,7.1,5.3,8.2
2012-04-01,8.1,7.2,8.6,7.6,10.7,8.1,8.3,7.2,9.3,8.7,...,7.9,6.9,5.6,5.0,6.2,8.4,7.4,7.1,5.3,8.2
2012-05-01,8.2,7.2,8.5,7.6,10.6,8.0,8.4,7.2,9.1,8.7,...,7.9,6.9,5.5,5.0,6.2,8.4,7.5,7.1,5.3,8.2


Let's print the start and end of the records in our unemployment Dataframe. Let's take a look at the `head()` of the unemployment DataFrame.

In [5]:
print(f'\nRecords from {df_unemp.index[0].strftime("%B %Y")} to {df_unemp.index[-1].strftime("%B %Y")}')
df_unemp.head()


Records from January 2012 to September 2020


Unnamed: 0_level_0,AL,AK,AZ,AR,CA,CO,CT,DE,DC,FL,...,TN,TX,UT,VT,VA,WA,WV,WI,WY,US
YearMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-01-01,8.1,7.4,8.7,7.7,11.0,8.2,8.2,7.1,9.8,8.9,...,7.9,7.1,5.9,5.1,6.3,8.6,7.4,7.2,5.5,8.3
2012-02-01,7.9,7.3,8.7,7.6,10.9,8.1,8.2,7.1,9.6,8.8,...,7.8,7.0,5.8,5.0,6.2,8.5,7.3,7.1,5.4,8.3
2012-03-01,8.0,7.3,8.6,7.6,10.8,8.1,8.3,7.1,9.4,8.7,...,7.8,6.9,5.7,5.0,6.2,8.5,7.3,7.1,5.3,8.2
2012-04-01,8.1,7.2,8.6,7.6,10.7,8.1,8.3,7.2,9.3,8.7,...,7.9,6.9,5.6,5.0,6.2,8.4,7.4,7.1,5.3,8.2
2012-05-01,8.2,7.2,8.5,7.6,10.6,8.0,8.4,7.2,9.1,8.7,...,7.9,6.9,5.5,5.0,6.2,8.4,7.5,7.1,5.3,8.2


The `describe()` method shows useful summary statistics on our DataFrame. The include `min`, `max`, `mean`, `std`, and `quartiles`.

In [6]:
print(f'\nRecords from {df_unemp.index[0].strftime("%B %Y")} to {df_unemp.index[-1].strftime("%B %Y")}')
df_unemp.describe()


Records from January 2012 to September 2020


Unnamed: 0,AL,AK,AZ,AR,CA,CO,CT,DE,DC,FL,...,TN,TX,UT,VT,VA,WA,WV,WI,WY,US
count,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,...,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0
mean,5.728571,6.935238,6.229524,5.24,6.82,4.678095,5.907619,5.486667,7.026667,5.642857,...,5.607619,5.104762,3.846667,3.798095,4.554286,5.978095,6.388571,4.862857,4.593333,5.721905
std,1.900896,1.190829,1.622197,1.739452,2.798743,2.152592,1.758698,2.087189,1.386561,2.209501,...,2.138507,1.662793,1.177927,1.908129,1.482204,1.973672,1.589104,1.889152,1.010357,2.058658
min,2.7,5.2,4.5,3.5,3.9,2.5,3.4,3.5,5.1,2.8,...,3.3,3.4,2.4,2.3,2.6,3.8,4.7,3.0,3.4,3.5
25%,4.0,6.5,4.8,3.7,4.4,3.0,4.5,4.2,5.8,4.0,...,3.5,4.0,3.1,2.7,3.3,4.6,5.2,3.3,3.9,4.1
50%,6.0,6.9,5.9,4.5,5.8,3.6,5.6,4.7,6.5,5.1,...,5.0,4.6,3.6,3.4,4.2,5.6,6.4,4.3,4.3,5.0
75%,7.2,7.0,7.5,7.1,8.7,6.4,7.7,6.4,8.4,6.9,...,7.4,6.1,4.3,4.3,5.7,6.9,6.9,6.3,5.1,7.2
max,13.8,13.5,13.4,10.8,16.4,12.2,10.2,15.9,11.7,13.8,...,15.5,13.5,10.4,16.5,11.2,16.3,15.9,13.6,9.6,14.7


### States with Highest and Lowest Unemployment

The cell below creates a table of states with highest and lowest mean annual unemployment rate for each year of the records. We can see Nevada and Alaska making record high numbers. On the other hand, we can see North Dakota having the lowest unemployment rate for several years.

In [7]:
table_unemp = pd.pivot_table(df_unemp, index=df_unemp.index.year, aggfunc=np.mean).T
pd.DataFrame(data={'Highest':[table_unemp[year].idxmax() for year in table_unemp.columns],
                   'Lowest':[table_unemp[year].idxmin() for year in table_unemp.columns]},
             index=[year for year in table_unemp.columns])

Unnamed: 0,Highest,Lowest
2012,NV,ND
2013,NV,ND
2014,NV,ND
2015,DC,ND
2016,AK,NH
2017,AK,HI
2018,AK,HI
2019,AK,ND
2020,NV,NE


### Boxplot for US States
The code below creates a boxplot of unemployment in all states and District of Columbia for each year. A scatter plot of the national rate is superimposed on the box plot. We can see the years with high national unemployment tend to have higher variance between the states.  

In [8]:
%matplotlib notebook
table_unemp = pd.pivot_table(df_unemp, index=df_unemp.index.year, aggfunc=np.mean).T

us_avg = table_unemp[table_unemp.index=='US']
#taking out the national unemployment rate
table_unemp = table_unemp[table_unemp.index!='US']

plt.figure();
table_unemp.boxplot(grid=False, whis=[0,100]);

plt.ylabel('Unemployment Rate, %');
plt.title('Unemployment Rate in US States');
ax = plt.gca()

plt.scatter(ax.get_xticks(), us_avg.values.reshape(-1), 
            label='National Rate', c='#e04e14', alpha=.6, s=6);
plt.legend();

<IPython.core.display.Javascript object>

In [9]:
path = 'data/Google Trends/{}_{}_{}_{}.csv'.format(states[0],KEYWORDS[0], START_DATE, END_DATE)

multicol = pd.MultiIndex.from_product([KEYWORDS, states], names=['keywords', 'states'])
index = pd.to_datetime(pd.read_csv(path)['YearMonth'])
df_google = pd.DataFrame(index=index, columns=multicol)

for keyword in KEYWORDS:
    for state in states:
        path = 'data/Google Trends/{}_{}_{}_{}.csv'.format(state,keyword, START_DATE, END_DATE)
        df_google[(keyword, state)] = pd.read_csv(path).iloc[:,-1].values     
        
df_google.head()

keywords,bored,bored,bored,bored,bored,bored,bored,bored,bored,bored,...,lunch,lunch,lunch,lunch,lunch,lunch,lunch,lunch,lunch,lunch
states,AL,AK,AZ,AR,CA,CO,CT,DE,DC,FL,...,TN,TX,UT,VT,VA,WA,WV,WI,WY,US
YearMonth,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2012-01-01,76,73,83,100,58,74,88,72,64,74,...,38,37,43,39,44,39,31,48,33,41
2012-02-01,72,41,86,82,51,54,61,68,47,79,...,38,33,38,29,47,46,35,43,48,41
2012-03-01,63,80,71,95,51,50,64,48,35,75,...,34,31,44,28,40,44,25,39,34,38
2012-04-01,70,44,50,84,55,47,67,69,38,68,...,30,31,33,32,40,44,37,37,39,39
2012-05-01,94,92,78,94,57,70,67,30,56,82,...,34,35,44,34,42,47,35,41,23,40


### Visualzation of Google Trends Data

In [10]:
pd.date_range(start='2012-08-01', end='2019-08-01', freq='12M')

DatetimeIndex(['2012-08-31', '2013-08-31', '2014-08-31', '2015-08-31',
               '2016-08-31', '2017-08-31', '2018-08-31'],
              dtype='datetime64[ns]', freq='12M')

In [10]:
%matplotlib notebook
kw = 'lunch'
plt.plot(df_google[kw,'US'].index, df_google[kw,'US'].values, lw=0.8, c ='#659c75', zorder=0)
aug = pd.date_range(start='2012-08-01', end='2019-08-01', freq='12MS')
plt.scatter(aug, df_google[kw,'US'].loc[aug], s=13, c='#ed4a4a', marker='^', zorder=1, label='August')
dec = pd.date_range(start='2012-12-01', end='2019-12-01', freq='12MS')
plt.scatter(dec, df_google[kw,'US'].loc[dec], s=13, c='#1f54cf', marker='v',zorder=1, label='December')
plt.legend()
plt.title(f'Searches for "{kw}" in the United States')
plt.tight_layout()

<IPython.core.display.Javascript object>

In [11]:
kw = 'unemployment office'
temp = pd.DataFrame(df_google[kw,'US'])
temp = temp.groupby(by=lambda x: x.year).agg([lambda x: x.idxmax().month_name(), 
                                             lambda x: x.idxmin().month_name()])
temp.columns = ['Highest','Lowest']
temp.style.set_caption(f'Months with Highest and Lowest seraches for "{kw}"')

Unnamed: 0,Highest,Lowest
2012,January,March
2013,January,November
2014,January,May
2015,January,May
2016,January,May
2017,January,March
2018,January,September
2019,January,March
2020,April,February


In [12]:
kw = 'bored'
temp = pd.DataFrame(df_google[kw,'US'])
temp = temp.groupby(by=lambda x: x.year).agg([lambda x: x.idxmax().month_name(), 
                                             lambda x: x.idxmin().month_name()])
temp.columns = ['Highest','Lowest']
temp.style.set_caption(f'Months with Highest and Lowest seraches for "{kw}"')

Unnamed: 0,Highest,Lowest
2012,June,October
2013,June,September
2014,June,September
2015,June,September
2016,June,September
2017,April,August
2018,January,August
2019,February,August
2020,April,August


### Plotting Correlation

In [13]:
goog_normalize = ((df_google - df_google.min())/
                  (df_google.max()-df_google.min()))

unemp_normalize = ((df_unemp - df_unemp.min())/
                  (df_unemp.max() - df_unemp.min()))
    
fig, ax = plt.subplots(nrows=1, ncols=3, sharex='all', sharey='all', figsize=(10.2,4));


for AX, keyword in zip(ax, KEYWORDS):
    
    pr = pearsonr(unemp_normalize['US'].values[1:], goog_normalize[(keyword, 'US')].shift().values[1:])[0]
    pr = round(pr,2)
    AX.scatter(unemp_normalize.index, unemp_normalize['US'].values, s=4, label='Unemployment Rate', alpha=.6)
    AX.scatter(goog_normalize.index, goog_normalize[(keyword, 'US')].shift().values, s=4,
               label='Google Searches', alpha=.6)
    AX.legend(fontsize=8)
    AX.tick_params(axis = 'x', labelrotation=45, labelsize=9)
    AX.tick_params(axis = 'y', labelsize=9)
    AX.text(0.05,0.82, f"Pearson's r: {pr}", size=8, transform=AX.transAxes, ha='left')
    AX.set_title(f'"{keyword}"', size=10)
    
    
ax[0].set_ylabel('Normalized Statistics', size=9);



fig.tight_layout()

plt.savefig('corr.png', dpi=400)


<IPython.core.display.Javascript object>

## Initializing the DataFrames for sklearn

In [102]:
lc = 'CA'

In [103]:
X = pd.DataFrame(index = df_google.index)


#Adding Last Month's unemployment data
X['Unemp_past_mo'] = df_unemp[lc].shift()

#Adding Google Trends data
for keyword in KEYWORDS:
    X[keyword] = df_google[(keyword), lc]
    X[f'{keyword}_past_mo'] = df_google[(keyword), lc].shift()
    X[f'{keyword}_2mo_ave'] = ((X[f'{keyword}_past_mo'] + X[keyword])/2)

#Adding the month as categorical data
X['Month'] = X.index.month

X = X.iloc[1:]
X.head()

Unnamed: 0_level_0,Unemp_past_mo,bored,bored_past_mo,bored_2mo_ave,unemployment office,unemployment office_past_mo,unemployment office_2mo_ave,lunch,lunch_past_mo,lunch_2mo_ave,Month
YearMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2012-02-01,11.0,53,55.0,54.0,36,53.0,44.5,42,42.0,42.0,2
2012-03-01,10.9,54,53.0,53.5,37,36.0,36.5,42,42.0,42.0,3
2012-04-01,10.8,57,54.0,55.5,46,37.0,41.5,43,42.0,42.5,4
2012-05-01,10.7,58,57.0,57.5,38,46.0,42.0,44,43.0,43.5,5
2012-06-01,10.6,68,58.0,63.0,42,38.0,40.0,43,44.0,43.5,6


In [104]:
y = df_unemp[lc].iloc[1:]


In [105]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion

numerical_columns = [0,1,2,3,4,5,6,7,8,9]
categorical_columns = [10]

one_hot_transformer = ColumnTransformer([('Categorical', 
                                          OneHotEncoder(categories='auto', sparse=False),categorical_columns)])

scale_and_poly = Pipeline([('Scaler', StandardScaler()),
                           ('Poly',PolynomialFeatures(degree=1))])

polynomial_transformer = ColumnTransformer([('Numerical', scale_and_poly, numerical_columns)])

trans_union = FeatureUnion([('OneHot',one_hot_transformer),
              ('scale_poly',polynomial_transformer)]);

In [106]:
CV = [(np.arange(0,len(X)+i),np.array([len(X)+i])) for i in range(-24, -4)]

In [107]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import make_scorer

pipe = Pipeline([('Transform', trans_union),
                 ('Model', Ridge())])

gs = GridSearchCV(pipe, param_grid={'Model__alpha':np.arange(1,30,1)}, cv=CV, scoring='neg_mean_squared_error')
gs.fit(X, y)
print("The best parameter is:", gs.best_params_)
print("The best score is:",gs.best_score_)

The best parameter is: {'Model__alpha': 29}
The best score is: -1.587525735845257


In [108]:
import seaborn as sns
import matplotlib.dates as mdates
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

model = gs.best_estimator_

y_pred = []
y_true = []
x_axis = []

for train_idx, test_idx in CV:
    X_train, y_train = X.values[train_idx], y.values[train_idx]
    X_test, y_test = X.values[test_idx], y.values[test_idx]
    model.fit(X_train, y_train)
    y_true.append(y_test)
    y_pred.append(model.predict(X_test))
    x_axis.append(y.index[test_idx])
    
plt.figure()
plt.scatter(x_axis, y_pred, label='Predicted', s=20,zorder=100, facecolor='None', edgecolor='tab:blue')
plt.scatter(x_axis, y_true, label='Actual', s=20,zorder=100, facecolor='None', edgecolor='tab:orange')
plt.ylabel('Unemployment Rate (%)')
myFmt = mdates.DateFormatter('%Y\n%b')
ax = plt.gca()
ax.xaxis.set_major_formatter(myFmt)
plt.legend()
plt.title(states_dict[lc])

r2 = round(r2_score(y_true=y_true, y_pred=y_pred),2)
mse = round(mean_squared_error(y_true=y_true, y_pred=y_pred),2)

plt.text(0.03,0.80, r'$R^{2}$ score: '+f'{r2}', transform=ax.transAxes, ha='left')
plt.text(0.03,0.75, 'MSE: '+f'{mse}', transform=ax.transAxes, ha='left')
plt.tight_layout()
plt.savefig(f'{lc}_model.png',dpi=400)


<IPython.core.display.Javascript object>

In [109]:
from sklearn import base

class DummyEstimator(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, past_month_column=0):
        self.past_month_column = past_month_column
    
    def fit(self, X, y=None):
        return self
    
    def predict(self, X):
        return np.array(X)[:, self.past_month_column]
    
    def score(self, X, y):
        return r2_score(y_true=y, y_pred=self.predict(X))

In [110]:
import seaborn as sns
import matplotlib.dates as mdates
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

model = DummyEstimator(past_month_column=X.columns.get_loc('Unemp_past_mo'))

y_pred = []
y_true = []
x_axis = []

for train_idx, test_idx in CV:
    X_train, y_train = X.values[train_idx], y.values[train_idx]
    X_test, y_test = X.values[test_idx], y.values[test_idx]
    model.fit(X_train, y_train)
    y_true.append(y_test)
    y_pred.append(model.predict(X_test))
    x_axis.append(y.index[test_idx])
    
plt.figure()
plt.scatter(x_axis, y_pred, label='Predicted', s=20,zorder=100, facecolor='None', edgecolor='tab:blue')
plt.scatter(x_axis, y_true, label='Actual', s=20,zorder=100, facecolor='None', edgecolor='tab:orange')
plt.ylabel('Unemployment Rate (%)')
myFmt = mdates.DateFormatter('%Y\n%b')
ax = plt.gca()
ax.xaxis.set_major_formatter(myFmt)
plt.legend()
plt.title(states_dict[lc])

r2 = round(r2_score(y_true=y_true, y_pred=y_pred),2)
mse = round(mean_squared_error(y_true=y_true, y_pred=y_pred),2)

plt.text(0.03,0.80, r'$R^{2}$ score: '+f'{r2}', transform=ax.transAxes, ha='left')
plt.text(0.03,0.75, 'MSE: '+f'{mse}', transform=ax.transAxes, ha='left')
plt.text(0.5,0.5, 'Dummy\nEstimator', transform=ax.transAxes, 
         alpha=.2, ha='center', va='center', size=30)
plt.tight_layout()
plt.savefig(f'{lc}_dummy.png',dpi=400)

<IPython.core.display.Javascript object>

In [111]:
X

Unnamed: 0_level_0,Unemp_past_mo,bored,bored_past_mo,bored_2mo_ave,unemployment office,unemployment office_past_mo,unemployment office_2mo_ave,lunch,lunch_past_mo,lunch_2mo_ave,Month
YearMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2012-02-01,11.0,53,55.0,54.0,36,53.0,44.5,42,42.0,42.0,2
2012-03-01,10.9,54,53.0,53.5,37,36.0,36.5,42,42.0,42.0,3
2012-04-01,10.8,57,54.0,55.5,46,37.0,41.5,43,42.0,42.5,4
2012-05-01,10.7,58,57.0,57.5,38,46.0,42.0,44,43.0,43.5,5
2012-06-01,10.6,68,58.0,63.0,42,38.0,40.0,43,44.0,43.5,6
...,...,...,...,...,...,...,...,...,...,...,...
2020-04-01,5.5,100,87.0,93.5,100,85.0,92.5,42,59.0,50.5,4
2020-05-01,16.4,73,100.0,86.5,63,100.0,81.5,49,42.0,45.5,5
2020-06-01,16.4,61,73.0,67.0,67,63.0,65.0,51,49.0,50.0,6
2020-07-01,14.9,53,61.0,57.0,53,67.0,60.0,52,51.0,51.5,7


In [36]:
df_unemp['MA']

YearMonth
2012-01-01     6.7
2012-02-01     6.7
2012-03-01     6.6
2012-04-01     6.6
2012-05-01     6.6
              ... 
2020-04-01    16.2
2020-05-01    16.6
2020-06-01    17.7
2020-07-01    16.2
2020-08-01    11.3
Name: MA, Length: 104, dtype: float64