<div hidden=True>
    author: Marco Angius
    company: TomorrowData srl
    mail: marco.anguis@tomorrowdata.io
    notebook-version: oct19
    
</div>

# Hands-on 1: Python and Pandas

This section is meant to learn some basics concepts of Python programming language which are used in the next sections and the following hands-on. After that one important Python library, Pandas, in analyzed in order to learn the fundamental building blocks for the next sessions.

# Python

From the official [site](https://www.python.org/):

>**Quick & Easy to Learn**
Experienced programmers in any other language can pick up Python very quickly, and beginners find the clean syntax and indentation structure easy to learn. Whet your [appetite with our Python 3 overview](https://docs.python.org/3/tutorial/)
>
>**Compound Data Types**
Lists (known as arrays in other languages) are one of the compound data types that Python understands. Lists can be indexed, sliced and manipulated with other built-in functions. [More about lists in Python 3](https://docs.python.org/3/tutorial/introduction.html#lists)
>
>**Intuitive Interpretation**
Calculations are simple with Python, and expression syntax is straightforward: the operators +, -, * and / work as expected; parentheses `()` can be used for grouping. [More about simple math functions in Python 3](http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator)
>
>**All the Flow You’d Expect**
Python knows the usual control flow statements that other languages speak `if`, `for`, `while` and `range` with some of its own twists, of course. [More control flow tools in Python 3](https://docs.python.org/3/tutorial/controlflow.html)
>
>**Functions Defined**
The core of extensible programming is defining functions. Python allows mandatory and optional arguments, keyword arguments, and even arbitrary argument lists. [More about defining functions in Python 3](https://docs.python.org/3/tutorial/introduction.html#lists)
>
>```python 
def fib(n):
    a, b = 0, 1
    while a < n:
        print(a, end=' ')
        a, b = b, a+b
    print("\nDone!")
>```

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Python Types**: 

- `None`: The Python “null” value (only one instance of the None object exists)
- `str`: String type; holds Unicode (UTF-8 encoded) strings
- `bytes`: Raw ASCII bytes (or Unicode encoded as bytes)
- `float`: Double-precision (64-bit) floating-point number (note there is no separate double type)
- `bool`: A True or False value
- `int`: Arbitrary precision signed integer

*NOTE: despite the presence of types, Python employs an Object Models. Every string, number, function etc... is living inside a box called a `Python object`.*

</div>

In [19]:
# dynamic reference and strong types 

In [20]:
# imports and functions

In [21]:
# binary operators and comparisons

In [23]:
# mutable and immutable objects

In [24]:
# type casting

In [25]:
# control flow 

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Python Built-In Data Structures**: 

- `tuple`: fixed length and immutable sequence of Python objects
- `list`: variable length and mutable sequence of Python objects
- `dict`: a flexible size collection of key-value pairs also called hash map. Key and values are Python objects.
- `set`: unordered collection of unique elements. Like dicts but with no values. 

</div>

In [None]:
# tuple

In [None]:
# list

In [None]:
# dict

In [None]:
# set

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Python Functions**: 

Functions are one of the most used methods for code organization and reuse.

- Function may present *positional* and *keyword* arguments. The latter are used for optional or default values and do not require to specify them in a precise order.

- Functions can also return one or more values.

- Functions Are Objects: they can be assigned to a variable or included in one of the previously seen data structures such as a list.

- Anonymous (Lambda) functions: simple, single element function defined with the `lambda` keyword.


</div>

In [None]:
# function example

In [None]:
# returned values (simple or other functions) 

In [None]:
# functions as an object

In [None]:
# anonymous functions

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Python Classes**: 

Python also support Object-Oriented programming paradigm.Because everything in Python is an object is it important to analyze the basic concept around the `class` definition.  

- `Attribute Reference`: define and access attributes (or fields) of a class.
- `Instantiation`: use function notation and allows for creating an instance of a class.
- `Method Objects`: define functions proper of a class which manipulates instance's attributes.

</div>

In [None]:
# define a simple class

In [None]:
# create instances of the above defined class

In [None]:
# show instance and class variables

<hr>

# Pandas

From the official [site](https://pandas.pydata.org/): 
> Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
> Pandas is well suited for many different kinds of data:
>- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
>- Ordered and unordered (not necessarily fixed-frequency) time series data.
>- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
>- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Pandas allows for loading data from different formats (csv, parquet, json, excel...).
The two main data structure used in padas are: 
- `pandas.DataFrame`: 2D labeled, size-mutable structure with heterogeneously-typed columns (tabular data)
- `pandas.Series`: 1D labeled homogeneously-typed array (suited for time series)

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 1: Series**

A Series can be considered as an array containing the same values type with labels. Labels represent values of a Series's index. 

- Define a **Series** object by passing a list of values `pd.Series(list(range(10)))` and store it in a variable named *my_series*.
- Update the index of *my_series* by assigning dates to the `my_series.index` attribute.  
- Check if a value is included in *my_series* using the syntax: `<any-value> in my_series`.
- Access one or more values by means of square bracket subscripting `s[<index-value>]`. It is possible to pass a list of index values.
- Use `numpy.exp(my_series)` to apply any function to the series.  
- Define another **Series** by passing a dictionary of key-value pairs , where the key is the index.

</div>

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Pandas Tip 1**: it is possible to create a range of datetimes values by using the pandas function `pd.date_range()` [API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html).

The function takes several parameters, the most important: 
- `start`: start date (example "2020-10-10")
- `end`: (optional) ending date (example "2020-10-19")
- `freq`: define the sampling frequency for the defined boundaries. Use "D" for days, "H" for hours, etc...
- `periods`: number of periods to generate (allows for more control)
</div>

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 2: DataFrames**

A DataFrame is a table and can be thought as a collection of Series sharing all the same index. The main concepts of a DataFrame are *raws* and *columns*, which are both indexed. 

- Define a `DataFrame` object by passing a dictionary of key-values pairs, where values is a list or an array of the same size (a dictionary is already provided). Assign the new instance to a variable called *my_frame*.

- Set the index of the new Dataframe with the provided *timestamp* array.

- Check for missing values. Use the method `my_frame.isna()`. 

- Fill missing values with the `my_frame.fillna()` method. It is also possible to drop missing values with `my_frame.dropna()`.

- Define a new index with the `dp.date_range()` function, with a range from *'2020-10-01'* to *2020-11-02* and daily frequency.

- Update the index of the *my_frame* using the `my_frame.reindex()` method. 

- Are there missing values after reindexing ? Check the [reindex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html) official api for possible solutions. 

</div>

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Pandas Tip 2**: it is possible to specify the index while instantiating either a **Series** or a **DataFrame** with the `index` parameter. For a **DataFrame** it is also possible to specify column names with the `columns` parameter. 

</div>

In [165]:
samples = 30
frame_dict = {
    'Temperature': [x if int(x) != 23 else None for x in np.random.normal(loc=22, scale=3, size=samples)],
    'Humidity': [x if int(x) != 57 else None for x in np.random.normal(loc=55, scale=5, size=samples)]
}
timestamp = pd.date_range(start='2020-10-01', periods=samples, freq='D')

<hr>

## Load and Manipulate DataFrames with pandas

The examples provided in this notebook are based on data coming from the BTP (*Bureau of Transportation Statistics*) concerning **Air Carrier Statistics** of U.S Carriers retrieved [here](https://www.transtats.bts.gov/tables.asp?Table_ID=258&SYS_Table_Name=T_T100D_MARKET_US_CARRIER_ONLY). 

In detail the dataset is based on the *T-100 Domestic Market*: 
> This table contains domestic market data reported by U.S. air carriers, including carrier, origin, destination, and service class for enplaned passengers, freight and mail when both origin and destination airports are located within the boundaries of the United States and its territories.
</div>

## T-100 Domestic Market Dataset


Details of selectable fields for the downloaded table can be found [here](https://www.transtats.bts.gov/DL_SelectFields.asp). 

The pre-downloaded data has the following columns: 
1. UniqueCarrier
2. UniqueCarrierName
3. CarrierRegion
4. OriginAirportID
5. Origin
6. OriginCityName
7. DestAirportID
8. Dest
9. DestCityName
10. Month
11. Passengeres
12. Freight
13. Mail
14. Distance

Only 2019 data has been downloaded in zip format. 

<hr>

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 3**
- load the *t100_domestic_market* dataset using the `pd.read_csv()`.
- display information about the dataframe using `df.info()`.
- show the first 5 rows of the dataframe using `df.head()`.

Do columns contain null values?

</div>

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Jupyter Notebooks Tip 1**: you can press `<tab>` to autocomplete or list the possible methods for an object.

</div>

In [2]:
AIRLINE_DATA = "./t100_domestic_market_bts.zip"

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Pandas Tip 3**: it is possible to show statistics about the current dataframe `df.describe()` for numerical-only features. 
It provides: 
<ul>
    <li>count: number of non-null samples in the dataframe for the given column</li>
    <li>mean: mean values for the given column</li>
    <li>std: standard deviation for the given column</li>
    <li>min/max: min/max values for the given column</li>
    <li>25%, 50%, 75%: percentiles for the given column</li>
</ul>

Percentiles are useful to check under which value a specified subset (percentage) of the observed data falls.

</div>

<hr>

<div class="alert alert-warning" role="alert">
    
<img src="./icons/new.png"  width="20" height="20" align="left"> &nbsp;  **NumPy** 

NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object called `ndarray`
- mathematical functions for fast operations
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

</div>


<div align="center">Example of slicing in pandas (same while using numpy arrays).</div>
<img src="./images/nparray_slicing.png"  width="900" height="600" align="center">

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 4**

As you have probably observed there is a strange column named `Unnamed: 11`. We can drop it! 
- drop the column using `df.drop()`. Use the *inplace* parameter if you want!

<br>

The general syntax for slicing a collection in python is `[start:stop:step]`. If start or stop is omitted the default value is the first or last element respectively.  


- show the first five rows to see if the column is still present. Use slicing `df[:5]`.
- what if we want to show the last 5 rows? 
- what about the last 5 rows in descending order?

<br>

Once indexed, it is possible to locate rows by means of their integer position (use `DataFrame.iloc[]`) or by means of the index value (use `df.loc[]`). 

- set an index, for example the *UNIQUE_CARRIER* column. Use `df.set_index()`.
- now try to find all rows related with the carrier id *27Q*. 

</div>

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Jupyter Notebooks Tip 2**: you can press `<left-shift> + <tab>` while the cursor is in between a function parenthesis to show the doc related with it

</div>

<hr>

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 5** 

A column in a panda *DataFrame* can contain whatever type, also python list or dictionaries.

It is also possible to select a single column or a subset of the columns. In any case, when a single column is selected what is returned is a `pandas.Series` object. Instead when multiple columns are selected a `pandas.DataFrame` is returned.

- select the *UNIQUE_CARRIER_NAME* column. Use `df["column_name"]` for selecting a single column.
- select multiple columns by passing a list of column names instead of a single column name.

There is another option available for selecting specific rows. This is done by means of a conditional statement. 

- Check what `df["PASSENGERS"] > 0` returns and save it on a variable. Do you have a clue of what is going on under the hood?
- Then try to select only the rows which satisfies the above condition. Use the variable as argument for the square bracket notation `df[]`.

</div>

<hr>

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 6** 

It is possible to add a column to an existing *DataFrame*. This is useful if we want to compute some statistics or we need some custom filter criteria.

- assigning a new column to an existing DataFrame. Use the syntax `df["new_column"] = ...` to assign a new object to a column. Sum the *PASSENGERS*, *FREIGHT* and *MAIL* columns and assign the summed values to a new column.
- check the results of the new column by getting some samples (try `df.sample(n_sample)`. 
- get the *UNIQUE_CARRIER* values for which the value of the new column is grater than 10'000. Use `Series.unique()` to keep only unique values of a Series object.

Keep in mind that the assigned object should be either a `pandas.Series` or a `numpy.array` object (also python lists are possible). 

</div>

<hr>

## Group by operations
Like in SQL it is possible to perform *group by* operations on a *DataFrame*. This is performed by means of `DataFrame.groupby` method which accept a single column or a list of columns. The returned type is a `DataFrameGroupBy` which has methods to perform operations over the grouped entries. The API for the **GroupBy** objects can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) while the reference user-guide can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 7** 

- find the number of entries for each carrier. Use `DataFrameGroupBy.count()`.
- find the total distance for each of the carrier. Use `DataFrameGroupBy.sum()`.

Which is the carrier with the highest distance?

</div>

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Pandas Tip 4**: due to `df.groupby()` applies only to columns and not index it is needed to reset the current index. Use `df.reset_index()` to achieve the goal. 

</div>

<hr>

## Combining and Merging
It is possible to combine data in pandas in different ways: 
1. `pandas.merge` connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
2. `pandas.concat` concatenates or “stacks” together objects along an axis.

[Here](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#joining-on-index) it can be found a full guide on options available for the two methods. 

In [5]:
# preparing mock dataframe for next exercise
df_left = df.drop(columns=["ORIGIN", "DEST"])
df_right = df[["ORIGIN_AIRPORT_ID", "ORIGIN"]].drop_duplicates()
df_right_dest = df[["DEST_AIRPORT_ID", "DEST"]].drop_duplicates().sample(frac=0.7)
df_dropped = df[["ORIGIN", "DEST"]]

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 8** 

- reconcile the information about the origin airport name by merging the prepared left and right dataset with the `pd.merge` pandas api. Merge by **ORIGIN_AIRPORT_ID**
- add the dropped columns *df_dropped* to the *df_left* dataframe in order to reconstruct the original one. Use the `pd.concat` pandas api specifying the `axis` where is wanted to perform the concatenation. 
- from the **df** datataframe create two new dataframes taking the first 30 rows and rows from index 60 to 90 (included) using the already seen slicing operator. Then create a new dataframe adding the two derived ones.

</div>

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Pandas Tip 5**: the `pd.merge` methods takes the left and right dataframes to be merged as parameters. In addition the `on=<columns>` parameter allows for specifying the key for the merge operation otherwise overlapping columns name are taken.

The `how=<left|right|inner|outer>` parameter allows for settings the merging strategy which by default is set to *inner*. 
</div>

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **Pandas Tip 6**: some pandas operations like `df.min`, `df.max` or `pd.concat` allows for specifying the axis where to perform such operation. There are two major aces in pandas: 

- axis 0: representing the rows (index)
- axis 1: representing the columns

</div>

<hr>

## Plotting results
Sometimes it is better to visualize the results we have processed. Pandas *DataFrame* has a `DataFrame.plot()` method which allows for this purpose. The plotting library used by pandas is **matplotlib**. 

<div class="alert alert-info" role="alert">
    
<img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp;  **Exercise 9** 

- show the number of flies for the different months of the year.
- add the title and change the size of the plot.

</div>

<div class="alert alert-success" role="alert">
    
<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp; **MatplotLib Tip 1**: You can create a new figure with the `f = plt.figure(figsize=[10, 5])` and use it to add a subplot `ax = f.add_subplot()` which returns an *Axes* object. 

Please note:
- The subplot can be passed to `DataFrame.plot()` call. 
- The referenced *Axes* object can then be used to set properties of the plot such as the title.

For more references see the [official api](https://matplotlib.org/3.1.1/api/axes_api.html#axis-labels-title-and-legend) doc for Axes.

</div>

## Homeworks

<div class="alert alert-danger" role="alert">
    
<img src="./icons/chemistry.png"  width="20" height="20" align="left"> &nbsp;  **Task 1**:  Plot the top 5 carriers based on their total distance.
[**SOLUTION**](./solutions/handson1/solution_1.py)

</div>

<div class="alert alert-danger" role="alert">
    
<img src="./icons/chemistry.png"  width="20" height="20" align="left"> &nbsp;  **Task 2**:  Plot the top 5 destinations based on the total number of passengers in April.
[**SOLUTION**](./solutions/handson1/solution_2.py)
</div>

<div class="alert alert-danger" role="alert">
    
<img src="./icons/chemistry.png"  width="20" height="20" align="left"> &nbsp;  **Task 3**:  Check the most crowded route (origin - destination) for *PASSENGERS*, *FREIGHT* and *MAIL*.
[**SOLUTION**](./solutions/handson1/solution_3.py)
</div>

<div class="alert alert-danger" role="alert">
    
<img src="./icons/chemistry.png"  width="20" height="20" align="left"> &nbsp;  **Task 4**:  Check if the number of outgoing passengers equals the total number of incoming passengers in all airports.
[**SOLUTION**](./solutions/handson1/solution_4.py)

</div>

<div hidden=True>
    <img src="./icons/list.png"  width="20" height="20" align="left"> &nbsp; Icon made by <a href="https://www.flaticon.com/authors/smashicons" title="Smashicons">Smashicons</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a>


<img src="./icons/lightbulb.png"  width="20" height="20" align="left"> &nbsp;Icon made by <a href="https://www.flaticon.com/authors/pixelmeetup" title="Pixelmeetup">Pixelmeetup</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a>

<img src="./icons/new.png"  width="20" height="20" align="left"> &nbsp; Icon made by <a href="https://www.flaticon.com/authors/pixel-perfect" title="Pixel perfect">Pixel perfect</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a>

<img src="./icons/chemistry.png"  width="20" height="20" align="left"> &nbsp; Icon made by <a href="https://www.flaticon.com/authors/popcorns-arts" title="Icon Pond">Icon Pond</a> from <a href="https://www.flaticon.com/"             title="Flaticon">www.flaticon.com</a></div>

</div>