Why not Lists?

In [1]:
names = ['Ram', 'Rahim', 'Hari', 'Raju']
heights = [169, 171, 168, 173]

In [2]:
ind_hari = names.index('Hari')
heights[ind_hari]

168

So, it is not convenient for large scale dataset. Hence we use <b>Dictionaries</b>

In [3]:
persons = {'Ram': 169, 'Rahim': 171, 'Hari': 168, 'Raju': 173}

In [4]:
persons['Ram']

169

In [5]:
persons['Rahim']

171

In [6]:
# check the keys of the dictionary
persons.keys()

dict_keys(['Ram', 'Rahim', 'Hari', 'Raju'])

## Dictionary Manipulations
+ adding new element
+ updating an existing element
+ removing an element

In [7]:
europe = {'spain':'madrid', 'france':'paris', 'germany':'munich', 'norway':'oslo' }

In [8]:
# adding italy to europe
europe['italy'] = 'rome'

In [9]:
europe

{'spain': 'madrid',
 'france': 'paris',
 'germany': 'munich',
 'norway': 'oslo',
 'italy': 'rome'}

In [10]:
# adding to europe
europe['poland'] = 'warsaw'

In [11]:
europe

{'spain': 'madrid',
 'france': 'paris',
 'germany': 'munich',
 'norway': 'oslo',
 'italy': 'rome',
 'poland': 'warsaw'}

In [12]:
# updating capital of germany
europe['germany'] = 'berlin'

In [13]:
europe

{'spain': 'madrid',
 'france': 'paris',
 'germany': 'berlin',
 'norway': 'oslo',
 'italy': 'rome',
 'poland': 'warsaw'}

In [14]:
capitals = ['delhi', 'dhaka', 'munich', 'colombo'] #list uppdating vs above dictionary updating

In [15]:
capitals[2] = 'berlin'

In [16]:
capitals

['delhi', 'dhaka', 'berlin', 'colombo']

In [17]:
europe

{'spain': 'madrid',
 'france': 'paris',
 'germany': 'berlin',
 'norway': 'oslo',
 'italy': 'rome',
 'poland': 'warsaw'}

In [18]:
# removing norway
del(europe['norway'])

In [19]:
europe

{'spain': 'madrid',
 'france': 'paris',
 'germany': 'berlin',
 'italy': 'rome',
 'poland': 'warsaw'}

## Nested Dictionaries

In [20]:
# dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }

In [21]:
europe

{'spain': {'capital': 'madrid', 'population': 46.77},
 'france': {'capital': 'paris', 'population': 66.03},
 'germany': {'capital': 'berlin', 'population': 80.62},
 'norway': {'capital': 'oslo', 'population': 5.084}}

In [22]:
type(europe)

dict

In [23]:
# to select data from nested dictionaries we use chained square brackets
europe['spain']['population']

46.77

In [25]:
europe['spain']

{'capital': 'madrid', 'population': 46.77}

## Pandas

Pandas is a high level data manipulation tool built on top of NumPy package.

Although we can create multi-dimensional arrays using NumPy, we do not use NumPy package as a data manipulation tool. Why?

### Creating DataFrames using Pandas

In [26]:
data = {
    'Country': ['Brazil', 'Russia', 'India', 'China', 'South Africa'],
    'Capital': ['Brasilia', 'Moscow', 'New Delhi', 'Beijing', 'Pretoria'],
    'Area': [8.516, 17.01, 3.286, 9.597, 1.221],
    'Population': [200.4, 143.5, 1252, 1357, 52.98]
}

In [27]:
type(data)

dict

In [28]:
import pandas as pd

In [29]:
brics = pd.DataFrame(data)

In [30]:
brics

Unnamed: 0,Country,Capital,Area,Population
0,Brazil,Brasilia,8.516,200.4
1,Russia,Moscow,17.01,143.5
2,India,New Delhi,3.286,1252.0
3,China,Beijing,9.597,1357.0
4,South Africa,Pretoria,1.221,52.98


#### Setting index in the dataset

In [31]:
brics.index = ['BR', 'RU', 'IN', 'CH', 'SA']

In [32]:
brics

Unnamed: 0,Country,Capital,Area,Population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.01,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


#### Reseting the index back to original

In [None]:
brics.reset_index()

#### Setting another column as Index

In [33]:
brics.set_index('Country')

Unnamed: 0_level_0,Capital,Area,Population
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Brazil,Brasilia,8.516,200.4
Russia,Moscow,17.01,143.5
India,New Delhi,3.286,1252.0
China,Beijing,9.597,1357.0
South Africa,Pretoria,1.221,52.98


### Loading a dataset as a DataFrame

In [36]:
cars = pd.read_csv('cars.csv')

In [35]:
cars

Unnamed: 0.1,Unnamed: 0,cars_per_cap,country,drives_right
0,US,809,United States,True
1,AUS,731,Australia,False
2,JAP,588,Japan,False
3,IN,18,India,False
4,RU,200,Russia,True
5,MOR,70,Morocco,True
6,EG,45,Egypt,True


In [38]:
cars = pd.read_csv('cars.csv', index_col=0)

In [39]:
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [40]:
cars = cars.reset_index()

In [41]:
cars

Unnamed: 0,index,cars_per_cap,country,drives_right
0,US,809,United States,True
1,AUS,731,Australia,False
2,JAP,588,Japan,False
3,IN,18,India,False
4,RU,200,Russia,True
5,MOR,70,Morocco,True
6,EG,45,Egypt,True


#### Indexing and Selecting data from DataFrame
+ square brackets
+ loc and iloc methods

In [42]:
brics.columns

Index(['Country', 'Capital', 'Area', 'Population'], dtype='object')

In [43]:
brics.columns = ['A', 'B', 'C', 'D']

In [44]:
brics

Unnamed: 0,A,B,C,D
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.01,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


In [45]:
brics.columns = ['Country', 'Capital', 'Area', 'Population']

In [46]:
brics

Unnamed: 0,Country,Capital,Area,Population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.01,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


Suppose you want to extract only the <b>country</b> variable from the dataset, use the brackets on dataset

In [47]:
cars['country']

0    United States
1        Australia
2            Japan
3            India
4           Russia
5          Morocco
6            Egypt
Name: country, dtype: object

In [48]:
type(cars['country']) # you can think of Series as 1-D NumPy arraY

pandas.core.series.Series

In [49]:
cars[['country']]

Unnamed: 0,country
0,United States
1,Australia
2,Japan
3,India
4,Russia
5,Morocco
6,Egypt


In [50]:
type(cars[['country']])

pandas.core.frame.DataFrame

In [51]:
cars[['cars_per_cap', 'country']]

Unnamed: 0,cars_per_cap,country
0,809,United States
1,731,Australia
2,588,Japan
3,18,India
4,200,Russia
5,70,Morocco
6,45,Egypt


#### selecting rows (observations) from the dataset

In [52]:
cars

Unnamed: 0,index,cars_per_cap,country,drives_right
0,US,809,United States,True
1,AUS,731,Australia,False
2,JAP,588,Japan,False
3,IN,18,India,False
4,RU,200,Russia,True
5,MOR,70,Morocco,True
6,EG,45,Egypt,True


In [53]:
cars[1:4]

Unnamed: 0,index,cars_per_cap,country,drives_right
1,AUS,731,Australia,False
2,JAP,588,Japan,False
3,IN,18,India,False


#### Other Advanced Indexing techniques
+ loc (label based)
+ iloc (integer position based)

it is similar to subsetting 2-D NumPy arrays

In [54]:
cars = cars.set_index('index')
cars

Unnamed: 0_level_0,cars_per_cap,country,drives_right
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [55]:
cars.loc['AUS']

cars_per_cap          731
country         Australia
drives_right        False
Name: AUS, dtype: object

In [56]:
cars.loc[['AUS']]

Unnamed: 0_level_0,cars_per_cap,country,drives_right
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AUS,731,Australia,False


In [57]:
cars.loc[['AUS', 'RU', 'US']] #it works only in double sq bracket

Unnamed: 0_level_0,cars_per_cap,country,drives_right
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AUS,731,Australia,False
RU,200,Russia,True
US,809,United States,True


In [58]:
cars.loc[['AUS', 'RU', 'US'], ['country', 'cars_per_cap']]

Unnamed: 0_level_0,country,cars_per_cap
index,Unnamed: 1_level_1,Unnamed: 2_level_1
AUS,Australia,731
RU,Russia,200
US,United States,809


In [59]:
cars.loc[:, ['country', 'cars_per_cap']]

Unnamed: 0_level_0,country,cars_per_cap
index,Unnamed: 1_level_1,Unnamed: 2_level_1
US,United States,809
AUS,Australia,731
JAP,Japan,588
IN,India,18
RU,Russia,200
MOR,Morocco,70
EG,Egypt,45


#### Advantages and Disadvantages of Different Data Manipulation Techniques

Square Brackets has 
+ column access
+ they have row access only through slicing e.g. cars[1:4]

loc (label based) technique has
+ row access e.g. cars.loc[['AUS', 'RU', 'US']]
+ column access e.g. cars.loc[:, ['AUS', 'RU', 'US']]
+ both row and column access cars.loc[['AUS', 'RU', 'US'], ['country', 'cars_per_cap']]

We can also subset Pandas DataFrame based on index of observations instead of their labels using <b>iloc</b> method

In [60]:
cars

Unnamed: 0_level_0,cars_per_cap,country,drives_right
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [61]:
cars.iloc[[1, 4, 0]]

Unnamed: 0_level_0,cars_per_cap,country,drives_right
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AUS,731,Australia,False
RU,200,Russia,True
US,809,United States,True


In [62]:
cars.iloc[[1, 4, 0], [1, 0]]

Unnamed: 0_level_0,country,cars_per_cap
index,Unnamed: 1_level_1,Unnamed: 2_level_1
AUS,Australia,731
RU,Russia,200
US,United States,809


In [63]:
cars

Unnamed: 0_level_0,cars_per_cap,country,drives_right
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [64]:
cars["country"].value_counts()

Russia           1
Japan            1
India            1
Australia        1
United States    1
Morocco          1
Egypt            1
Name: country, dtype: int64

In [65]:
cars.iloc[:, [1, 0]]

Unnamed: 0_level_0,country,cars_per_cap
index,Unnamed: 1_level_1,Unnamed: 2_level_1
US,United States,809
AUS,Australia,731
JAP,Japan,588
IN,India,18
RU,Russia,200
MOR,Morocco,70
EG,Egypt,45
