<a href="https://colab.research.google.com/github/vicente-gonzalez-ruiz/YAPT/blob/master/scientific_computation/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Pandas](http://pandas.pydata.org/)
High-performance data structures (usually, "datasets" in the context of machine learning) and data analysis tools for the Python programming language, similar to [R](https://en.wikipedia.org/wiki/R_(programming_language). Some tools are:
1. [Statistical functions (covariance, correlation)](http://pandas.pydata.org/pandas-docs/stable/computation.html#statistical-functions).
2. [Window functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions).
3. [Time series](http://pandas.pydata.org/pandas-docs/stable/timeseries.html).
4. [Analysis of sparse data](http://pandas.pydata.org/pandas-docs/stable/sparse.html).

## Table of Contents:
1. [Install](#install)
2. [Series](#series)
3. [DataFrame](#dataframe)
4. [Indexation](#indexation)
5. [Concatenation](#concatenation)
6. [Agregation](#agregation)
7. [Find and replace](#find_and_replace)

## 1. Install <a class="anchor" id="install"></a>

In [2]:
!pip3 install pandas

Collecting pandas
  Downloading pandas-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m01[0m
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.0.3 pytz-2023.3 tzdata-2023.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import pandas as pd

## 2. Series <a class="anchor" id="series"></a>

In [5]:
step_data = [3620, 7891, 9761, 3907, 4338, 5373]
step_counts = pd.Series(step_data, name='steps')
print(step_counts)

0    3620
1    7891
2    9761
3    3907
4    4338
5    5373
Name: steps, dtype: int64


## 3. DataFrame's <a class="anchor" id="dataframe"></a>

In [6]:
df = pd.DataFrame({'int_col' : [1, 2, 6, 8, -1],
                    'float_col' : [0.1, 0.2, 0.2, 10.1, None],
                    'str_col' : ['a', 'b', None, 'c', 'a']})
print(df)

   int_col  float_col str_col
0        1        0.1       a
1        2        0.2       b
2        6        0.2    None
3        8       10.1       c
4       -1        NaN       a


In [8]:
cycling_distances =  [10.7, 0, None, 2.4, 15.3, 10.9, 0, None]
joined_data = list(zip(step_data,cycling_distances))
activity_df = pd.DataFrame(joined_data)
print(activity_df)

      0     1
0  3620  10.7
1  7891   0.0
2  9761   NaN
3  3907   2.4
4  4338  15.3
5  5373  10.9


## 4. Indexation <a class="anchor" id="indexation"></a>

### Modify the index

In [73]:
activity_df = pd.DataFrame(joined_data, index=pd.date_range('20150329', periods=6), columns=['Walking','Cycling'])
activity_df

Unnamed: 0,Walking,Cycling
2015-03-29,3620,10.7
2015-03-30,7891,0.0
2015-03-31,9761,
2015-04-01,3907,2.4
2015-04-02,4338,15.3
2015-04-03,5373,10.9


### Extract one row

In [74]:
activity_df.loc['2015-04-01']

Walking    3907.0
Cycling       2.4
Name: 2015-04-01 00:00:00, dtype: float64

### Extract one column

In [75]:
activity_df['Walking']

2015-03-29    3620
2015-03-30    7891
2015-03-31    9761
2015-04-01    3907
2015-04-02    4338
2015-04-03    5373
Freq: D, Name: Walking, dtype: int64

In [76]:
activity_df.iloc[:,0]

2015-03-29    3620
2015-03-30    7891
2015-03-31    9761
2015-04-01    3907
2015-04-02    4338
2015-04-03    5373
Freq: D, Name: Walking, dtype: int64

## Load the Iris dataset

In [77]:
!pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [78]:
from sklearn import datasets
import numpy as np

In [79]:
iris = datasets.load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [80]:
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

In [82]:
iris_df.iloc[:5]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


## 5. Concatenation <a class="anchor" id="concatenation"></a>

### Example 1

In [64]:
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
df1 = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
df2 = pd.DataFrame(data=iris['target'], columns=['target'])

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]      target
0         0
1         0
2   

In [65]:
df1

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [66]:
df2

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
145,2
146,2
147,2
148,2


In [63]:
concatenated_df = pd.concat([df1, df2], axis=1)
concatenated_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


### Example 2

In [69]:
iris_df['sepal_area'] = iris_df['sepal width (cm)'] * iris_df['sepal length (cm)']
iris_df.iloc[:5, -3:]

Unnamed: 0,target,sepal_area,abbrev
0,0.0,17.85,0
1,0.0,14.7,0
2,0.0,15.04,0
3,0.0,14.26,0
4,0.0,18.0,0


## 6. Aggregation <a class="anchor" id="agregation"></a>

In [67]:
aggregated_df = df1.groupby(df2['target']).mean()

In [68]:
aggregated_df

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,5.006,3.428,1.462,0.246
1,5.936,2.77,4.26,1.326
2,6.588,2.974,5.552,2.026


## 7. Find and replace <a class="anchor" id="find_and_replace"></a>

In [85]:
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
iris_df['target'] = iris['target']
iris_df['species'] = iris_df['target'].map({0: 'Iris-setosa', 1: 'Iris-versicolor', 2: 'Iris-virginica'})

In [89]:
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,species
0,5.1,3.5,1.4,0.2,0,Iris-setosa
1,4.9,3.0,1.4,0.2,0,Iris-setosa
2,4.7,3.2,1.3,0.2,0,Iris-setosa
3,4.6,3.1,1.5,0.2,0,Iris-setosa
4,5.0,3.6,1.4,0.2,0,Iris-setosa
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,Iris-virginica
146,6.3,2.5,5.0,1.9,2,Iris-virginica
147,6.5,3.0,5.2,2.0,2,Iris-virginica
148,6.2,3.4,5.4,2.3,2,Iris-virginica


In [91]:
#iris_df[''] = (iris_df.target.apply(lambda x: x.replace('Iris-','')))
iris_df[''] = iris_df.species.apply(lambda x: x.replace('Iris-', ''))
#iris_df['abbrev'] = iris_df.target.astype(int).astype(str).apply(lambda x: x.replace('Iris-', ''))
iris_df.iloc[:5, -3:]

Unnamed: 0,target,species,Unnamed: 3
0,0,Iris-setosa,setosa
1,0,Iris-setosa,setosa
2,0,Iris-setosa,setosa
3,0,Iris-setosa,setosa
4,0,Iris-setosa,setosa


## 8. Vectorization <a class="anchor" id="vectorization"></a>

In [94]:
import pandas as pd
import numpy as np
data = {"x": 2**np.arange(5),
"y": 3**np.arange(5),
"z": np.array([45, 98, 24, 11, 64])}
index = ["a", "b", "c", "d", "e"]
df = pd.DataFrame(data=data, index=index)

In [95]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [96]:
# Creating mask
mask = df["z"]<50
# Using mask
df["z"][mask]=0

In [97]:
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64
