<a href="https://colab.research.google.com/github/tarsojabbes/data-visualization/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas**

In [1]:
import numpy as np
import pandas as pd

## **Introduction to Pandas**

In [2]:
data = pd.Series([0.25,0.5,0.75,1], index=['a','b','c','d'])

In [3]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
data['b']

0.5

In [6]:
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

## **Pandas Series**

In [7]:
grades_dict = {'A': 4, 'B': 3.5, 'C': 3, 'D': 2.5}
grades = pd.Series(grades_dict)

In [8]:
grades

A    4.0
B    3.5
C    3.0
D    2.5
dtype: float64

In [9]:
marks_dict = {'A': 85, 'B': 75, 'C': 65, 'D': 55}
marks = pd.Series(marks_dict)

In [10]:
marks['A']

85

## **Pandas DataFrame**

In [11]:
df = pd.DataFrame({'grades': grades, 'marks': marks})

In [12]:
df

Unnamed: 0,grades,marks
A,4.0,85
B,3.5,75
C,3.0,65
D,2.5,55


In [13]:
df['scaled marks'] = np.floor(100*(df['marks']/90))

In [14]:
df

Unnamed: 0,grades,marks,scaled marks
A,4.0,85,94.0
B,3.5,75,83.0
C,3.0,65,72.0
D,2.5,55,61.0


## **Missing Values**

In [15]:
missing_values = pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

In [16]:
missing_values

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [17]:
missing_values.fillna(0)

Unnamed: 0,a,b,c
0,1.0,2,0.0
1,0.0,3,4.0


## **Pandas Indexing**

In [18]:
example = pd.Series(['a', 'b', 'c'], index=[1,7,5])

Explicit indexes (the indexes we expose when creating our series) (loc() can be used)

In [21]:
print(example[1])
print(example[7])

a
b


Implicit indexes (the normal array indexing)

In [22]:
print(example.iloc[0:3])

1    a
7    b
5    c
dtype: object


## **Pandas Practice**

In [23]:
from google.colab import files
import io

uploaded = files.upload()

Saving data.csv to data.csv


In [24]:
data = pd.read_csv(io.BytesIO(uploaded['data.csv']), header=None)

In [25]:
data.columns = ['C' + str(x) for x in range(data.shape[1])]

In [26]:
data.head()

Unnamed: 0,C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Changing the anual wage salary from '>=50k' and '<50k' to 1 and -1

In [27]:
label = data['C14'].unique()

In [28]:
idx = data['C14']==label[0]

In [29]:
data['C14'].loc[idx] = -1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [30]:
data['C14'].loc[~idx] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Changing the job type from string type to columns

In [31]:
data['C1'].unique().size # We have 9 different job types

9

In [32]:
data = pd.get_dummies(data, columns = ['C1', 'C14'])

C1 columns will be deleted and we'll see the addition of 9 columns named C1_'job_type', as well as C14 now as C14_-1 and C14_1

In [33]:
data

Unnamed: 0,C0,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C1_ ?,C1_ Federal-gov,C1_ Local-gov,C1_ Never-worked,C1_ Private,C1_ Self-emp-inc,C1_ Self-emp-not-inc,C1_ State-gov,C1_ Without-pay,C14_-1,C14_1
0,39,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,0,0,0,0,0,0,1,0,1,0
1,50,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,0,0,0,0,0,1,0,0,1,0
2,38,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,0,0,0,1,0,0,0,0,1,0
3,53,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,0,0,0,1,0,0,0,0,1,0
4,28,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0,0,0,0,1,0,0,0,0,1,0
32557,40,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,0,0,0,0,1,0,0,0,0,0,1
32558,58,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0,0,0,0,1,0,0,0,0,1,0
32559,22,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0,0,0,0,1,0,0,0,0,1,0


## **Pandas GroupBy**

In [34]:
dataframe = pd.DataFrame({'ProductName': ['Bulb', 'Bulb', 'Fan', 'Fan'], 
                          'Type': ['A', 'B', 'A', 'A'],
                          'EC': [400., 300., 250.,300.]})

In [40]:
dataframe.groupby([dataframe.Type]).sum()

Unnamed: 0_level_0,EC
Type,Unnamed: 1_level_1
A,950.0
B,300.0


In [42]:
dataframe.groupby([dataframe.ProductName, dataframe.Type]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,EC
ProductName,Type,Unnamed: 2_level_1
Bulb,A,400.0
Bulb,B,300.0
Fan,A,550.0


## **Pandas Hierarchical Indexing**

In [44]:
a = [['Bulb', 'Bulb', 'Bulb', 'Fan', 'Fan', 'Fan'],
     ['A', 'B', 'C', 'A', 'B', 'C']]

index = pd.MultiIndex.from_arrays(a, names=('ProductName', 'Type'))
df = pd.DataFrame({'EC': [20., 30, 40, 25, 10, 30]}, index=index)

In [45]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,EC
ProductName,Type,Unnamed: 2_level_1
Bulb,A,20.0
Bulb,B,30.0
Bulb,C,40.0
Fan,A,25.0
Fan,B,10.0
Fan,C,30.0


This will sum the Energy Consumption in the level of the first columns, the ProductName

In [46]:
df.groupby(level=0).sum()

Unnamed: 0_level_0,EC
ProductName,Unnamed: 1_level_1
Bulb,90.0
Fan,65.0


This will sum the energy Consumption in the level of the second column, the Type

In [48]:
df.groupby(level=1).sum()

Unnamed: 0_level_0,EC
Type,Unnamed: 1_level_1
A,45.0
B,40.0
C,70.0


## **Pandas Rolling**

In [50]:
df = pd.DataFrame({'A': np.random.randint(0,10,5),
                   'B': np.random.randint(0,10,5),
                   'C': np.random.randint(0,10,5)})

In [52]:
df

Unnamed: 0,A,B,C
0,9,1,0
1,2,0,4
2,5,4,0
3,1,0,2
4,9,0,4


In [53]:
df.rolling(2,min_periods=1).sum()

Unnamed: 0,A,B,C
0,9.0,1.0,0.0
1,11.0,1.0,4.0
2,7.0,4.0,4.0
3,6.0,4.0,2.0
4,10.0,0.0,6.0


## **Pandas Where**

In [54]:
df = pd.DataFrame(np.arange(10).reshape(5,2), columns=['A', 'B'])

In [55]:
df.where(df<5, -df) # Where this condition is false, we do the action

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,-5
3,-6,-7
4,-8,-9


## **Pandas Clip**

In [56]:
df = pd.DataFrame(np.random.randint(0,50,(5,10)), columns=list("ABDCEFGHIJ"))

In [57]:
df

Unnamed: 0,A,B,D,C,E,F,G,H,I,J
0,3,28,42,1,28,12,33,7,20,37
1,13,23,22,3,7,10,1,26,44,12
2,9,3,14,43,27,22,17,20,6,11
3,40,32,3,19,27,37,11,10,16,25
4,40,26,47,36,16,44,26,37,21,8


With clip it's possible to define the lower and upper boundaries to numeric values. Each value that is small than the lower level will be increase until it reaches the lower values. If it's bigger, it will be decreased until it reaches the upper boundary.

In [58]:
df.clip(10, 30)

Unnamed: 0,A,B,D,C,E,F,G,H,I,J
0,10,28,30,10,28,12,30,10,20,30
1,13,23,22,10,10,10,10,26,30,12
2,10,10,14,30,27,22,17,20,10,11
3,30,30,10,19,27,30,11,10,16,25
4,30,26,30,30,16,30,26,30,21,10


## **Pandas Merge**

In [61]:
df1 = pd.DataFrame({'E':[ 'B', 'G', 'L', 'S'],
                    'G': ['A', 'E', 'E', 'H']})
df2 = pd.DataFrame({'E': ['L', 'B', 'G', 'S'],
                    'H':[2004, 2008, 2012, 2018]})

In [62]:
pd.merge(df1, df2)

Unnamed: 0,E,G,H
0,B,A,2008
1,G,E,2012
2,L,E,2004
3,S,H,2018


In [63]:
df3 = pd.merge(df1, df2)

In [64]:
df4 = pd.DataFrame({'G': ['A', 'E', 'H'],
                    'S': ['C', 'G', 'S']})

In [65]:
pd.merge(df3, df4, on="G")

Unnamed: 0,E,G,H,S
0,B,A,2008,C
1,G,E,2012,G
2,L,E,2004,G
3,S,H,2018,S
