# Timeseries 1 - Basic Things & Pandas (~15 minutes)

## Layout:

- (1) - Pandas basics(~5 min)<BR>
 - Just execute quickly and note keywords
- (2) - Easy Plots (~10 min)


### Prerequisites:

First, if needed, install and load some packages.

In [None]:
### if you want to run on your own computer => upgrade required package
# ! pip install matplotlib --upgrade
# ! pip install pandas --upgrade
# ! pip install seaborn --upgrade
# ! pip install plotly --upgrade
# ! pip install pystan --upgrade
# ! pip install statsmodels
# ! pip install prophet --upgrade

In [None]:
import numpy as np
import pandas as pd

# (1) A little primer on Pandas:

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is THE tool to know when performing any kind of "on-disk" data analysis. Basically, there are two main components of the pandas library:

- Series : data with an index
- Dataframes : Multiple series with one index (think spreadsheets)



## Series
The first main data type we will learn about for pandas is the Series data type.

#### A Series is very similar to a NumPy array The differences is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also can hold any arbitrary Python Object.

You can create series from many python structures

Here is some data:

In [None]:
labels = ['a','b','c']
a_list = [1,2,3]
a_nparr = np.array([1,2,3])
a_dict = {'a':1,'b':2,'c':3}

Here we create three series with the same data (the list `[1,2,3]`)

In [None]:
pd.Series(data=a_list)
pd.Series(data=a_list,index=labels) # index is not really a column
pd.Series(a_list,labels)

You can also use numpy arrays or dicts. The cool thing about dicts is that they contain both labels and data.

In [None]:
pd.Series(a_nparr)
pd.Series(a_nparr,labels)
pd.Series(a_dict)

Series can hold anything, even python functions.

In [None]:
pd.Series([max,min,sum])

here's a serie

In [None]:
a_serie = pd.Series(data=a_list,index=labels)
a_serie

The values are stored in a np.array accessible with the `.values` property

In [None]:
a_serie.values

labels are stored accessible with the `.index` property

In [None]:
a_serie.index

### You may have noticed, but series have Indexes !!

Understanding this is the KEY to pandas series. Pandas uses indexes for fast lookups - "think hashtable"

In [None]:
home_fruit_inventory = pd.Series([4,2,3,4],index = ['Apple', 'Orange','Cherry', 'Banana'])
needed_fruits = pd.Series([0,1,4,3],index = ['Apple', 'Orange','Cherry', 'Banana'])

Index are usefull for many things, like **not** adding apples and oranges

In [None]:
home_fruit_inventory - needed_fruits

## DataFrames

DataFrames are pandas main datastructures.  A DataFrame can be considered as an ensemble of Series objects with the same index. It's like an excel spreadsheet within python.

It can be created with some data, an index and columns names

In [None]:
inventories = [[4,2,3,4],[0,1,3,4],[0,0,3,1],[1,None,2,0]]

df = pd.DataFrame(inventories,index=['Apple', 'Orange','Cherry', 'Banana'],columns=["Home","Store1","Store2","Store3"])
df

## What's inside ?

Pandas offer many built-in functions to have a quick overview of a dataframe's data.

### Dataframe information functions

In [None]:
df.describe() # Get some stats

In [None]:
df.dtypes # The data types

In [None]:
df.info() # Some more info

## Selecting data from a dataframe

The `df[...]` indexing usually work as intended when you're used to numpy. But one must be careful, it can become ambiguous.

In [None]:
df['Home']  # <-- this returns the Series object "Home"
# or df.Home

`df.loc` works by "index/label" and `df.iloc` works by positionning

In [None]:
df.loc["Orange"] # or by "position" df.iloc[1]

Avoid the "double" select and use .loc[r,c] or .iloc[r,c] like in numpy "x[row,column]"

In [None]:
# same as df["Home"]["Apple"] <--- This is not recommended
df.loc["Apple","Home"]

Of course you can select things easily

In [None]:
print(df == 0)
df[df["Home"]==0]

And don't forget: it's just numpy under the hood !

In [None]:
df.values # you get the numpy array (where series are axis=0)

## Recap:

- Pandas main structure are dataframe which are simply concatenated series which share the same index.
- Series are essentially lists, with indexes.

[Their documentation is here](https://pandas.pydata.org/pandas-docs/stable/)



# Pandas and Timeseries

Ok, let's dig in more time series related things

## Data : Bike Sharing Demand data

You are provided hourly rental data spanning two years.

At first, we only consider two data fields:

- datetime - hourly date + timestamp  
- count - number of total rentals


## Loading data with read_csv:

We do two specific things while loading:

- `usecols`: We only consider the datetime and the count series
- `parse_dates` : We parse the datetime serie as dates

NB: [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv) has a TON of options, be sure to check them

In [None]:
#lets load the data and only consider the count as a serie.
df = pd.read_csv("https://raw.githubusercontent.com/vguigue/TimeSeries/main/data/train.csv",parse_dates=["datetime"],usecols=['datetime','count'])

df.head()

Ok, what can we do with this simple raw serie ?

### First things first:
Answer those simple questions:

- How many observations do we have ? (10886)
- What is the min/maximum value (1/977)
- Are there missing values ? (Nope)


In [None]:
print(len(df))
print(df["count"].min(),df["count"].max())
print(df["count"].isna().sum())

## Setting time as the index

For now, the serie is indexed by integers (0,1,2,3,...) which can make it hard to find specific days/hours
It would be easier if we could directly use dates to select observations.

To do so, we can set the datetime as the dataframe index by using the `df.set_index` method

In [None]:
time_indexed = df.set_index("datetime") #here
time_indexed.head()

In [None]:
time_indexed.reset_index().head() #reverses the "set_index"

In [None]:
time_indexed.reset_index(drop=True).head() #reverses the "set_index" but discards the index

### Select the counts of march/april 2011

**Note**: the range selection here is inclusive $[start:end]$ whereas on arrays it's $[start:end[$

In [None]:
time_indexed['2011-03-01':'2011-04-30']


### Decomposing dates

One reason of why it's really useful to parse dates (besides use them as index) is because it can be easily used for feature building:

Indeed, it's easy to understand that the bike demand might vary between days (week-days/end) or season (summer/winter). Fortunately, all these informations can be readily extrapolated from datetimes by calling one of the many attribute [datetime-data](https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#datetime-data) such as `.minute` or `.day`.



In [None]:
df["minutes"] = df.datetime.dt.minute
df["hour"]  = df.datetime.dt.hour
df["day"]  = df.datetime.dt.day
df["month"]  = df.datetime.dt.month
df["year"]  = df.datetime.dt.year
df["weekday"]  = df.datetime.dt.day_of_week


time_indexed = df
time_indexed = time_indexed.set_index("datetime")
time_indexed.head()


## (b) Easy Plotting

The best way to visualize time series are plots. To make plots in python, there are LOT of existing options, here we'll concentrate on two:

- Matplotlib
- Seaborn

## Matplotlib : The classic one

Matplotlib is integrated in pandas and
[Pandas can automagically plot things using matplotlib](https://pandas.pydata.org/pandas-docs/version/0.23.4/api.html#api-dataframe-plotting). Let's compare quickly the two ways of using matplotlib


### Let's say we want to visualize the bike count on the fifth day:

#### 1 - The RAW way : calling `plt.plot`


In [None]:
%matplotlib inline
#Makes sure you get an image in notebook
import matplotlib.pyplot as plt

day_number = 5
day_offset = (day_number-1)*23
plt.plot(time_indexed["count"].values[day_offset:day_offset+23])

# In truth,
# plt.plot(time_indexed.loc["20110105","count"].values) would have worked just fine.

plt.show()                      # Shows plot

#### 2 -  The pandas way
with pandas it's much easier:
(and you get free $x$ labels)

In [None]:
time_indexed.loc["20110105","count"].plot() # benefits from indexed time
plt.show()

## Plot Types

There are multiple plot types built int:

<pre>
df.plot.hist()     histogram
df.plot.bar()      bar chart
df.plot.barh()     horizontal bar chart
df.plot.line()     line chart
df.plot.area()     area chart
df.plot.scatter()  scatter plot
...
</pre>

NOTE: You can also call specific plots by passing their name as an argument, as with `df.plot(kind='area')`.

## (TODO) What if we want to plot a bunch of days on the same $x$ axis ?

- Plot days 2,4,6,8,12 on the same x axis which index goes from 0 to 23.

In [None]:
for day in [2,4,6,8,12]:
    day_number = day
    # to complete

plt.show()

### (Todo) The following code does not plot days on the same $x$ axis, fix it !

In [None]:
# This doesn't work well -> Why ?
time_indexed.loc["20110102","count"].plot() # To FIX !!!!
time_indexed.loc["20110104","count"].plot() # To FIX !!!!
time_indexed.loc["20110106","count"].plot() # To FIX !!!!
time_indexed.loc["20110108","count"].plot() # To FIX !!!!
time_indexed.loc["20110112","count"].plot() # To FIX !!!!
plt.show()

## Seaborn

[Seaborn](https://seaborn.pydata.org/index.html) is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Here is some of the functionality that seaborn offers:

    - A dataset-oriented API for examining relationships between multiple variables
    - Specialized support for using categorical variables to show observations or aggregate statistics
    - Options for visualizing univariate or bivariate distributions and for comparing them between subsets of data
    - Automatic estimation and plotting of linear regression models for different kinds dependent variables
    - Convenient views onto the overall structure of complex datasets
    - High-level abstractions for structuring multi-plot grids that let you easily build complex visualizations
    - Concise control over matplotlib figure styling with several built-in themes
    - Tools for choosing color palettes that faithfully reveal patterns in your data

Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

#### What's interesting with seaborn is that it's tightly integrated with Pandas:

Recall our `time_indexed` dataframe

In [None]:
time_indexed.head()

Let's say we want to see how does the bike rental count evolves through a day.
We can simply say we want to see a [line plot](https://seaborn.pydata.org/generated/seaborn.lineplot.html#seaborn.lineplot) of the count through the hours. Seaborn does all the handywork:

In [None]:
import seaborn as sns

sns.lineplot(data=time_indexed, x="hour",y="count")
# sns.lineplot(data=time_indexed, x="weekday",y="count") # try with day

Does it changes through the years ? We can simply add a `hue` on the year variable

In [None]:
sns.lineplot(data=time_indexed, x="hour",y="count",hue="year")

####  Is there a difference between week days and weekend days ? what could we plot to see this ?

In [None]:
sns.lineplot(data=time_indexed[time_indexed["weekday"] == 5], x="hour",y="count")
sns.lineplot(data=time_indexed[time_indexed["weekday"] == 6], x="hour",y="count")
sns.lineplot(data=time_indexed[time_indexed["weekday"] < 5], x="hour",y="count")

### This was just a glimpse of seaborn

Be sure to have a look [at their documentation](https://seaborn.pydata.org/tutorial.html)

## What if we want to see the bigger picture ?

In [None]:
time_indexed["06-2011"]

Let's plot the 19 first day of a month (june 2011):

In [None]:
month_data = time_indexed.loc["06-01-2011":"06-19-2011","count"]

month_data.plot(figsize=(25,12))