___

<a href='https://www.youtube.com/FallinPython'> <img src="_images/FallinPython_Jupyter-01.jpg" width="750" height="400" align="center"/></a>
___

In [1]:
import pandas as pd
import numpy as np

# Pandas Library
* Website:       https://pandas.pydata.org/ 
* Install Pandas:       https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html
* Documentation: https://pandas.pydata.org/docs/
* User Guide: https://pandas.pydata.org/docs/user_guide/index.html#user-guide

## Installation using pip

<img src="_images/carbon_1.png" width="500" height="400" align="left" />

# 1. Pandas Data Structures

The most important data structures from Pandas library are `Series` and `DataFrame`.<br>
It is a very good practice to ask yourself if the command you are about to type in Pandas will return a `Series` or `DataFrame` data structure.<br>
Each of them has its own methods and attributes and if you want to get confortable with Pandas, you need to learn what each of this data structure is able to do.

* Pandas Series: 1-Dimensional
* Pandas DataFrame: 2-Dimensional

<img src="_images/pandas_data_structures.png" width="1000" height="400" align="center" /><br>
___

## 1.1. Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). <br>
A Pandas Series behaves a bit like a `numpy array` and a bit like a `python dictionary`.
* Series as NumPy array: Pandas is built on top of NumPy for high performance array computing. We can get and set values from a Series by index.
* Series as Python Dictionary: You are able to get and set values by index label.

### 1.1.1 Creating Series

It's important to notice that <u>it is not very common</u> that you willl create `Series` manually on Pandas. The most common way is that you pipe the data into Pandas and from this data you will manipulate it ending up with either a `DataFrame` or a `Series`.

<img src="_images/common_usage_pandas.png" width="750" height="400" align="center" />

The basic method to create a Series is to call the `Series` method from Pandas:

```python
pd.Series(
    data=None,
    index=None,
    dtype=None,
    name=None,
    copy=False,
    fastpath=False,
)
```

In [2]:
my_series = pd.Series([100,200,300], index=["Monday", "Tuesday","Thrusday"],name="Sales")
my_series

Monday      100
Tuesday     200
Thrusday    300
Name: Sales, dtype: int64

We can provide different data types to the data parameter to create a pandas `Series`:
* List or tupple
* Numpy Array
* Scalar value
* Dictionary

**Creating Series from List**

In [100]:
data = [10, 20, 30, 40, 50]
my_series = pd.Series(data, index=list('abcde'))  # , index=[x for x in 'ABCDE']  or index=list('ABCDE')
my_series

a    10
b    20
c    30
d    40
e    50
dtype: int64

**Creating Series from a NumPy Array**

In [4]:
data = np.arange(10,60,10)
my_series = pd.Series(data)
print(my_series)

0    10
1    20
2    30
3    40
4    50
dtype: int32


**Creating Series from a Scalar Value**

In [5]:
index_list = ["a", "b", "c", "d", "e"]
my_series = pd.Series(13, index=index_list)   # , name="scalar value" (numpy broadcasting)
print(my_series)

a    13
b    13
c    13
d    13
e    13
dtype: int64


**Creating Series from a Dictionary**

In [6]:
data = {"D": 10, "B": 20, "C": 30, "A": 40, "E": 50}
my_series = pd.Series(data,name="Values")
print(my_series)

D    10
B    20
C    30
A    40
E    50
Name: Values, dtype: int64


### 1.1.2 Basic Operations on Series 

#### Accessing / setting elements

In [7]:
my_dict = {"a": 10, "b": 20, "c": 30, "d": 40, "e": 50}
my_series = pd.Series(my_dict)
print(my_series)

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [51]:
# retrieving a single element => using index
my_series[0]

a    100
b     20
c     30
d     40
e     50
dtype: int64

In [54]:
# retrieving a single element => using index label
my_series["a"]

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [10]:
# retrieving n-elements => using index
my_series[[0,3,-1]]

a    10
d    40
e    50
dtype: int64

In [11]:
# retrieving n-elements => using index label
my_series[["a","d","e"]]

a    10
d    40
e    50
dtype: int64

#### Slicing pandas series

In [65]:
my_series

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [69]:
# slicing using index => (numpy-like)
my_series[0:3]=[10,20,30]
my_series

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [71]:
# slicing using index label => (dicitonary-like)
my_series["a":"c"]=[3,2,1]
my_series

a     3
b     2
c     1
d    40
e    50
dtype: int64

**Important**:<br>
There is a better way to select elements of a pandas `Series` and we will see them in the **Series Methods abd Attributes** session. The good thing to learn it now is that we will also use it with pandas `DataFrame`. I am talking about the `.loc` and `.iloc` methods!

#### Boolean Filter

In [72]:
my_array = np.arange(10,55,5)
my_series = pd.Series(my_array)
print(my_series)

0    10
1    15
2    20
3    25
4    30
5    35
6    40
7    45
8    50
dtype: int32


In [15]:
# single condition
filter_ = (my_series > 30)
my_series[filter_]

5    35
6    40
7    45
8    50
dtype: int32

In [16]:
# multiple conditions with and operator
filter_ = (my_series > 20) & (my_series <= 40)
my_series[filter_]

3    25
4    30
5    35
6    40
dtype: int32

In [17]:
# multiple condition with or operator
filter_ = (my_series < 20) | (my_series >= 40)
my_series[filter_]

0    10
1    15
6    40
7    45
8    50
dtype: int32

### 1.1.3 Arithmetic operations on Series 

Similar to a NumPy array, you can perform arithmetic operations on pandas `Series` and even between `Series`

#### Arithmetic Operation on a single Series

In [78]:
series_1 = pd.Series([10,20,30,40,50])
series_1

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [83]:
series_1**5

0       100000
1      3200000
2     24300000
3    102400000
4    312500000
dtype: int64

#### Arithmetic Operation between Series

Arithmetic operation happens element-wise as Pandas is built on top of NumPy.
The detail you need to pay attention is to check if the series have the same index:

<img src="_images/series_arithmetic_same_index.png" width="550" height="400" align="left" />

In [87]:
series_1 = pd.Series([10,20,30,40,50])
series_2 = pd.Series([30,70,100,120,150])

In [94]:
series_1*series_2

0     300
1    1400
2    3000
3    4800
4    7500
dtype: int64

<img src="_images/series_arithmetic_different_index.png" width="550" height="400" align="left" />

In [95]:
series_1 = pd.Series([10,20,30,40,50])
series_2 = pd.Series([30,70,100])

In [98]:
series_1+series_2

0     40.0
1     90.0
2    130.0
3      NaN
4      NaN
dtype: float64

**Important**:<br>
If you try to make arithmetic opration between series that has different index, you will end up with `NaN` (Not a Number) where there is no index correspondence between the `Series`.

### 1.1.4 Series Methods and Attributes - (Good to know)

* Chech them out in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/series.html

Pandas `Series` has plenty of methods and attributes and I will not go through all of them, however I will point out some useful ones that will help us during this course. You can check the complete list using the link from the online documentation above or using the python buil-in function `dir`.

In [24]:
dir(pd.Series([1,2,3]))

['T',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__

In [25]:
len([x for x in dir(pd.Series([1,2,3])) if not x.startswith("__")])

330

**How to get the index?**

In [26]:
my_dict = {"a": 10, "b": 20, "c": 30, "d": 40, "e": 50}
my_series = pd.Series(my_dict)
print(my_series)

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [27]:
my_series.index.to_list()

['a', 'b', 'c', 'd', 'e']

In [28]:
# how to rename the index (all at once)
my_series.index = ["Brazil","Germany","France","Belgium","Marocco"]
my_series

Brazil     10
Germany    20
France     30
Belgium    40
Marocco    50
dtype: int64

In [29]:
# how to rename a single or some indices
#my_series.index[0] = 'England'  # it does not work
my_series.rename({"Brazil":"England", "Belgium":"USA"}, inplace=True)
my_series

England    10
Germany    20
France     30
USA        40
Marocco    50
dtype: int64

**How to get and set values?**

In [30]:
# accessing the values from a series
my_series.to_numpy()

array([10, 20, 30, 40, 50], dtype=int64)

In [31]:
# you can set values to a Series
my_series[0] = 1000
my_series["France"] = 500
my_series

England    1000
Germany      20
France      500
USA          40
Marocco      50
dtype: int64

**Methods: `loc()` and `iloc()`**

In [32]:
# creating a series from a dictionary
my_dict = {"Brazil": 6, "Germany": 7, "France": 8, "Belgium": 9, "Marocco": 10}
series_1 = pd.Series(my_dict)
series_1

Brazil      6
Germany     7
France      8
Belgium     9
Marocco    10
dtype: int64

In [33]:
# .iloc selects elements by index
series_1.iloc[[0,2,-1]]

Brazil      6
France      8
Marocco    10
dtype: int64

In [34]:
# .loc selects elements by index label
series_1.loc[["Brazil","France","Marocco"]]  

Brazil      6
France      8
Marocco    10
dtype: int64

In [35]:
# possible to combine .loc with boolean filter
series_1.loc[series_1 > 7]

France      8
Belgium     9
Marocco    10
dtype: int64

In [36]:
# setting values using loc and iloc methods
series_1.loc[series_1 > 7] = 10
series_1

Brazil      6
Germany     7
France     10
Belgium    10
Marocco    10
dtype: int64

**Methods: `count()` and `value_counts()`**

In [37]:
import random

#let's create a series with n Elements
n = 500
countries = ["Brazil", "Germany", "France", "Belgium", "Marocco"]
list_countries = [random.choice(countries) for count in range(n)]
series_1 = pd.Series(list_countries)
series_1

0      Belgium
1       France
2       Brazil
3      Marocco
4      Belgium
        ...   
495     Brazil
496    Germany
497    Germany
498    Belgium
499    Belgium
Length: 500, dtype: object

In [38]:
# method: count()
# Question: How many entries are there?
series_1.count()

500

In [39]:
# method: value_counts()
# Question: How many times each country appears?
series_1.value_counts(normalize=False)

Belgium    125
Germany    102
Brazil     102
France      87
Marocco     84
dtype: int64

**Methods: `nunique()` and `unique()`**

In [40]:
series_1

0      Belgium
1       France
2       Brazil
3      Marocco
4      Belgium
        ...   
495     Brazil
496    Germany
497    Germany
498    Belgium
499    Belgium
Length: 500, dtype: object

In [41]:
# method: nunique()
# Question: How many unique countries are there?
series_1.nunique()

5

In [42]:
# method: unique()
# Question: What are the unique countries?
series_1.unique()

array(['Belgium', 'France', 'Brazil', 'Marocco', 'Germany'], dtype=object)

**Method: `describe()`**

In [43]:
# this method works for both: categorical and numerical data
series_1.describe()

count         500
unique          5
top       Belgium
freq          125
dtype: object

**Method: `min()`, `max()`, `sum()`, `cumsum()`, `mean()`**

In [44]:
my_dict = {"a": 10, "b": 20, "c": 30, "d": 40, "e": 50}
my_series = pd.Series(my_dict)
print(my_series)

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [45]:
# methods: idxmin(), idxmax()
# Question: What is the maximum value and its indice?
print(my_series.min())
print(my_series.idxmin())

10
a


In [46]:
# method: min(), max(), sum(), mean()
print(my_series.min())
print(my_series.max())
print(my_series.mean())
print(my_series.sum())

10
50
30.0
150


In [47]:
my_series

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [48]:
# method: cumsum()
my_series.cumsum()

a     10
b     30
c     60
d    100
e    150
dtype: int64

There are much more `Series` methods and we will get to know more of them during this course.