# Python for Data Science Project Session 4: Economics and Finance

This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in the Capital bike-share system with the corresponding weather and seasonal information. More information about the dataset you can find [here](https://archive-beta.ics.uci.edu/ml/datasets/bike+sharing+dataset). This notebook will cover tasks such as data transformations, pivot tables and simple regression.

## Analysing the dataset

First, let's import Pandas and NumPy.

Now, we need to upload the data (use `pandas.csv_read()`,dataset name is `day.csv`, and save if as `df`).

Display the dataframe and use `.describe()` to check if your dataset has any missing values.

We can see that our dataset has no missing values. Now, let's drop columns that we won't use (`casual`, `registered`).

As we can see our weekdays are displayed as numbers. We want to make it more intuitive, so that we could see the name of the day corresponding to the number. To do it, create the dataframe that contains the number (`no`; 0, 1, 2, ...) and the corresponding day (`day`; "Mon", "Tue", "Wed", ...). Call it `weekdays`.

The last piece of data that we will be using is the data about shifts. `shift.csv` contains the date and the name of the employee that was on a shift that day (let's say they work at the helpdesk). We need to upload it (call the dataframe `shift`, first column `date` and the second one `employee`) and display it.

We have all the data that we need!

We would like to combine `weekdays` with `df` on the number of the day. As we can see, in `df` the number of the day is called `weekday`, and in `weekdays` it is called `no`. Therefore, we need to change the name of one of the columns. Let's rename the `weekdays` dataframe column name from `no` to `weekday` (use `.rename()`).

Now we can merge them on `weekday` (use `.merge()`). Name the new dataframe `merged`.

Let's check if we merged the data cocrrectly.

We would like to do the same with our `merged` dataframe and `shift` dataframe. As in the pervious example, we need to rename some columns. Rename `dteday` to `date` and display the new `date` column.

As we can see, we have a different date formats. To fix it we going to use `datetime` library, `.strptime()` and `.strftime()`. You can find an example of how to do it [here](https://stackoverflow.com/questions/14524322/how-to-convert-a-date-string-to-different-format).

After changing the date format, you can merge the two dataframes together. Name the final dataframe `final_df`.

To check if you have correctly merged the dataframe, display the sample of 10 rows from the `final_df`.

Let's say that we want to inspect the employees performance. Display the mean `cnt` for each employee using `.groupby()`.

Harry has lower `cnt` compared to the others. It might be because they work on different days of the week. To check it, first let's check if the `cnt` differ across different days of the week. Display the mean `cnt` for each day of the week.

The differences in mean `cnt` across different days of the week do exist! To check if it causes Harry to has lower `cnt`, we can use `.pivot_table()`.

As we can see, Harry works only on Monday, Tuesday and Wednesday, which might be the cause of his lower `cnt`.

# OLS model

Now we will create a simple predictive model, which will forecast the `cnt` for a given day. To do it, we need to import `statsmodels.api`.

We can drop all of the unnecessary data, so that only `mnth`, `holiday`, `workingday`, `temp`, `atemp`, `hum`, `windspeed`, `day` and `cnt` are left.

The `day` is a categorical variable, so to run a regression we need to create dummy variables. To do it, use `.get_dummies()` command.

Display the final_df to check if you have created the data correctly.

Now it's time for the regression! Create two new dataframes `y` and `x`. `y` is the dataframe that contains the `cnt` column, and `x` contains all the other columns (of the dataframe with dummy variables).

Now, we will run our model and display the model summary! (Just run the commands below).

We can see all the important regression information which we can analyse!

To predict the value for the next day, we need to create a new dataframe that we will use as an input. Create a new datafeame `to_predict` with the same column names as `x` dataframe (you can use `.columns()`).

Now lets append our dataframe with tomorrow's data which are as follows:

    Month: 1; Holiday: 0; Workingday: 1; Temp: 0.25; Atemp: 0.2; Hum: 0.5; Windspeed: 0.15; Day: Sat (you need to represent day as a set of dummy variables)

To predict our dataframe, we just need to use `model.predict()` and as an argument plug in the dataframe with our values!