# Python for Data Science Project Session 4: Economics and Finance

This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in the Capital bike-share system with the corresponding weather and seasonal information. More information about the dataset you can find [here](https://archive-beta.ics.uci.edu/ml/datasets/bike+sharing+dataset). This notebook will cover tasks such as data transformations, pivot tables and simple regression.

## Analysing the dataset

First, let's import Pandas and NumPy.

In [1]:
import pandas as pd
import numpy as np

Now, we need to upload the data (use `pandas.csv_read()`,dataset name is `day`, and save if as `df`).

In [2]:
df = pd.read_csv('dataset/day.csv')

Display the dataframe and use `.describe()` to check if your dataset has any missing values.

In [3]:
df

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.200000,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.229270,0.436957,0.186900,82,1518,1600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,727,2012-12-27,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1,1,12,0,5,1,2,0.253333,0.255046,0.590000,0.155471,644,2451,3095
728,729,2012-12-29,1,1,12,0,6,0,2,0.253333,0.242400,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1,1,12,0,0,0,1,0.255833,0.231700,0.483333,0.350754,364,1432,1796


In [4]:
df.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


We can see that our dataset has no missing values. Now, let's drop columns that we won't use (`casual`, `registered`).

In [5]:
df = df.drop(['casual','registered'],axis=1)

As we can see our weekdays are displayed as numbers. We want to make it more intuitive, so that we could see the name of the day corresponding to the number. To do it, create the dataframe that contains the number (`no`) and the corresponding day (`day`; "Mon", "Tue", "Wed", ...). Call it `weekdays`.

In [6]:
weekdays = pd.DataFrame(data={'no': [0,1,2,3,4,5,6], 'day': ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']})

The last piece of data that we will be using is the data about shifts. `shift.csv` contains the date and the name of the employee that was on a shift that day (let's say they work at the helpdesk). We need to upload it (call the dataframe `shift`, first column `date` and the second one `employee`) and display it.

In [7]:
shift = pd.read_csv('dataset/shift.csv',names=['date','employee'])

In [8]:
shift

Unnamed: 0,date,employee
0,01/01/2011,Kate
1,02/01/2011,John
2,03/01/2011,Harry
3,04/01/2011,Harry
4,05/01/2011,John
...,...,...
726,27/12/2012,John
727,28/12/2012,Kate
728,29/12/2012,Kate
729,30/12/2012,Kate


We have all the data that we need!

We would like to combine `weekdays` with `df` on the number of the day. As we can see, in `df` the number of the day is called `weekday`, and in `weekdays` it is called `no`. Therefore, we need to change the name of one of the columns. Let's rename the `weekdays` dataframe column name from `no` to `weekday` (use `.rename()`).

In [9]:
weekdays = weekdays.rename(columns={"no": "weekday"})

Now we can merge them on `weekday` (use `.merge()`). Name the new dataframe `merged`.

In [10]:
merged = pd.merge(df, weekdays, how='inner', on='weekday')

Let's check if we merged the data cocrrectly.

In [11]:
merged

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt,day
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,985,Sun
1,8,2011-01-08,1,0,1,0,6,0,2,0.165000,0.162254,0.535833,0.266804,959,Sun
2,15,2011-01-15,1,0,1,0,6,0,2,0.233333,0.248112,0.498750,0.157963,1248,Sun
3,22,2011-01-22,1,0,1,0,6,0,1,0.059130,0.079070,0.400000,0.171970,981,Sun
4,29,2011-01-29,1,0,1,0,6,0,1,0.196522,0.212126,0.651739,0.145365,1098,Sun
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,700,2012-11-30,4,1,11,0,5,1,1,0.298333,0.323867,0.649583,0.058471,5668,Sat
727,707,2012-12-07,4,1,12,0,5,1,2,0.320833,0.321958,0.764167,0.130600,5008,Sat
728,714,2012-12-14,4,1,12,0,5,1,1,0.281667,0.294192,0.642917,0.131229,5611,Sat
729,721,2012-12-21,1,1,12,0,5,1,2,0.326667,0.301767,0.556667,0.374383,3623,Sat


We would like to do the same with our `merged` dataframe and `shift` dataframe. As in the pervious example, we need to rename some columns. Rename `dteday` to `date` and display the new `date` column.

In [12]:
merged = merged.rename(columns={"dteday": "date"})

In [13]:
merged['date']

0      2011-01-01
1      2011-01-08
2      2011-01-15
3      2011-01-22
4      2011-01-29
          ...    
726    2012-11-30
727    2012-12-07
728    2012-12-14
729    2012-12-21
730    2012-12-28
Name: date, Length: 731, dtype: object

As we can see, we have a different date formats. To fix it we going to use `datetime` library, `.strptime()` and `.strftime()`. You can find an example of how to do it [here](https://stackoverflow.com/questions/14524322/how-to-convert-a-date-string-to-different-format).

In [14]:
import datetime

In [15]:
i = 0
for date in merged['date']:
    merged['date'].loc[i] = datetime.datetime.strptime(merged['date'].loc[i], '%Y-%m-%d').strftime('%d/%m/20%y')
    i+=1


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


After changing the date format, you can merge the two dataframes together. Name the final dataframe `final_df`.

In [16]:
final_df = pd.merge(merged, shift, how='inner', on='date')

To check if you have correctly merged the dataframe, display the sample of 10 rows from the `final_df`.

In [17]:
final_df.sample(10)

Unnamed: 0,instant,date,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt,day,employee
116,79,20/03/2011,1,0,3,0,0,0,1,0.3325,0.32575,0.47375,0.207721,2471,Mon,Harry
303,654,15/10/2012,4,1,10,0,1,1,2,0.561667,0.53915,0.7075,0.296037,5875,Tue,John
477,411,15/02/2012,1,1,2,0,3,1,1,0.348333,0.351629,0.53125,0.1816,4169,Thu,John
102,715,15/12/2012,4,1,12,0,6,0,1,0.324167,0.338383,0.650417,0.10635,5047,Sun,Kate
309,696,26/11/2012,4,1,11,0,1,1,1,0.313333,0.339004,0.535417,0.04665,5087,Tue,Harry
585,440,15/03/2012,1,1,3,0,4,1,1,0.5575,0.532825,0.579583,0.149883,6192,Fri,John
428,68,09/03/2011,1,0,3,0,3,1,2,0.295833,0.286608,0.775417,0.22015,1891,Thu,John
507,621,12/09/2012,3,1,9,0,3,1,1,0.599167,0.570075,0.577083,0.131846,7870,Thu,John
51,358,24/12/2011,1,0,12,0,6,0,1,0.3025,0.299242,0.5425,0.190304,1011,Sun,Kate
683,399,03/02/2012,1,1,2,0,5,1,1,0.313333,0.309346,0.526667,0.178496,4151,Sat,Kate


Let's say that we want to inspect the employees performance. Display the mean `cnt` for each employee using `.groupby()`.

In [18]:
final_df[['employee','cnt']].groupby(['employee']).mean()

Unnamed: 0_level_0,cnt
employee,Unnamed: 1_level_1
Harry,4374.960396
John,4520.156489
Kate,4586.726592


Harry has lower `cnt` compared to the others. It might be because they work on different days of the week. To check it, first let's check if the `cnt` differ across different days of the week. Display the mean `cnt` for each day of the week.

In [19]:
final_df.groupby(['day']).mean().sort_values(by=['weekday'])

Unnamed: 0_level_0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Mon,366.0,2.485714,0.504762,6.47619,0.0,0.0,0.0,1.304762,0.483628,0.465288,0.627659,0.18853,4228.828571
Tue,367.0,2.495238,0.504762,6.495238,0.142857,1.0,0.857143,1.390476,0.493449,0.474563,0.637577,0.190691,4338.12381
Wed,364.5,2.519231,0.5,6.480769,0.009615,2.0,0.990385,1.442308,0.504282,0.483337,0.641829,0.191825,4510.663462
Thu,365.5,2.5,0.5,6.509615,0.009615,3.0,0.990385,1.451923,0.504626,0.48161,0.645368,0.187736,4548.538462
Fri,366.5,2.509615,0.5,6.548077,0.019231,4.0,0.980769,1.384615,0.504342,0.4827,0.609499,0.191603,4667.259615
Sat,367.5,2.490385,0.5,6.576923,0.019231,5.0,0.980769,1.394231,0.495589,0.471112,0.613756,0.186389,4690.288462
Sun,365.0,2.47619,0.495238,6.552381,0.0,6.0,0.0,1.4,0.482038,0.462071,0.61956,0.196588,4550.542857


The differences in mean `cnt` across different days of the week do exist! To check if it causes Harry to has lower `cnt`, we can use `.pivot_table()`.

In [20]:
table = pd.pivot_table(final_df, values='cnt', index=['employee'],
                    columns=['day'], aggfunc=np.mean, fill_value=0)

In [21]:
table

day,Fri,Mon,Sat,Sun,Thu,Tue,Wed
employee,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Harry,0.0,4353.680556,0.0,0.0,0.0,4130.560606,4650.9375
John,4667.259615,3899.333333,0.0,0.0,4548.538462,4808.8,3745.894737
Kate,0.0,4004.0,4690.288462,4550.542857,0.0,4563.684211,4775.095238


As we can see, Harry works only on Monday, Tuesday and Wednesday, which might be the cause of his lower `cnt`.

# OLS model

Now we will create a simple predictive model, which will forecast the `cnt` for a given day. To do it, we need to import `statsmodels.api`.

In [22]:
import statsmodels.api as sm

We can drop all of the unnecessary data, so that only `mnth`, `holiday`, `workingday`, `temp`, `atemp`, `hum`, `windspeed`, `day` and `cnt` are left.

In [24]:
final_df = final_df.drop(['instant','date','season','yr','weathersit','weekday','employee',],axis=1)

The `day` is a categorical variable, so to run a regression we need to create dummy variables. To do it, use `.get_dummies()` command.

In [26]:
final_df = pd.get_dummies(final_df,prefix=['d'],columns=['day'])

Display the final_df to check if you have created the data correctly.

In [27]:
final_df

Unnamed: 0,mnth,holiday,workingday,temp,atemp,hum,windspeed,cnt,d_Fri,d_Mon,d_Sat,d_Sun,d_Thu,d_Tue,d_Wed
0,1,0,0,0.344167,0.363625,0.805833,0.160446,985,0,0,0,1,0,0,0
1,1,0,0,0.165000,0.162254,0.535833,0.266804,959,0,0,0,1,0,0,0
2,1,0,0,0.233333,0.248112,0.498750,0.157963,1248,0,0,0,1,0,0,0
3,1,0,0,0.059130,0.079070,0.400000,0.171970,981,0,0,0,1,0,0,0
4,1,0,0,0.196522,0.212126,0.651739,0.145365,1098,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,11,0,1,0.298333,0.323867,0.649583,0.058471,5668,0,0,1,0,0,0,0
727,12,0,1,0.320833,0.321958,0.764167,0.130600,5008,0,0,1,0,0,0,0
728,12,0,1,0.281667,0.294192,0.642917,0.131229,5611,0,0,1,0,0,0,0
729,12,0,1,0.326667,0.301767,0.556667,0.374383,3623,0,0,1,0,0,0,0


Now it's time for the regression! Create two new dataframes `y` and `x`. `y` is the dataframe that contains the `cnt` column, and `x` contains all the other columns (of the dataframe with dummy variables).

In [28]:
x = final_df[["mnth","holiday","workingday","temp","atemp","hum","windspeed","d_Tue","d_Wed","d_Thu","d_Fri","d_Sat","d_Sun"]]
y = final_df["cnt"]

Now, we will run our model and display the model summary! (Just run the commands below).

In [29]:
model = sm.OLS(y, x).fit()
model.summary()

0,1,2,3
Dep. Variable:,cnt,R-squared (uncentered):,0.913
Model:,OLS,Adj. R-squared (uncentered):,0.911
Method:,Least Squares,F-statistic:,626.1
Date:,"Wed, 16 Feb 2022",Prob (F-statistic):,0.0
Time:,01:26:03,Log-Likelihood:,-6357.9
No. Observations:,731,AIC:,12740.0
Df Residuals:,719,BIC:,12790.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mnth,114.7884,16.420,6.991,0.000,82.551,147.026
holiday,26.1834,298.169,0.088,0.930,-559.202,611.568
workingday,559.8599,121.204,4.619,0.000,321.904,797.815
temp,-3147.5535,2274.580,-1.384,0.167,-7613.165,1318.058
atemp,1.17e+04,2534.732,4.618,0.000,6728.411,1.67e+04
hum,-1369.6214,312.299,-4.386,0.000,-1982.748,-756.495
windspeed,-97.3979,593.636,-0.164,0.870,-1262.866,1068.070
d_Tue,-0.7531,135.153,-0.006,0.996,-266.094,264.588
d_Wed,39.6719,145.951,0.272,0.786,-246.868,326.212

0,1,2,3
Omnibus:,1.133,Durbin-Watson:,0.777
Prob(Omnibus):,0.568,Jarque-Bera (JB):,1.221
Skew:,-0.081,Prob(JB):,0.543
Kurtosis:,2.882,Cond. No.,4.88e+16


We can see all the important regression information which we can analyse!

To predict the value for the next day, we need to create a new dataframe that we will use as an input. Create a new datafeame `to_predict` with the same column names as `x` dataframe (you can use `.columns()`).

In [31]:
to_predict = pd.DataFrame(columns=x.columns)

Now lets append our dataframe with tomorrow's data which are as follows:

    Month: 1; Holiday: 0; Workingday: 1; Temp: 0.25; Atemp: 0.2; Hum: 0.5; Windspeed: 0.15; Day: Sat (you need to represent day as a set of dummy variables)

In [32]:
to_predict = to_predict.append({"mnth":1,"holiday":0,"workingday":1,"temp":0.25,"atemp":0.2,"hum":0.5,"windspeed":0.15,"d_Tue":0,"d_Wed":0,"d_Thu":0,"d_Fri":0,"d_Sat":1,"d_Sun":0},ignore_index = True)

To predict our dataframe, we just need to use `model.predict()` and as an argument plug in the dataframe with our values!

In [33]:
predicted_cnt = model.predict(to_predict)

In [34]:
predicted_cnt

0    1673.810586
dtype: float64