# Pandas DataFrames
In data science, the most important complex data structure is the **DataFrame**.
DataFrames are a collection of tabular data -- you might think of them as *tables* or *datasets*, depending on your background.

In [1]:
# Import pandas
import pandas as pd

In [2]:
country_dict =  {"Brazil":"BR", "Russia":"RU", "India":"IN", "China":"CH", "South Africa":"SA"}
data_dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
             "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
             "area": [8.516, 17.10, 3.286, 9.597, 1.221],
             "population": [200.4, 143.5, 1252, 1357, 52.98] }

# creat a dataframe from python dictionary
data_df = pd.DataFrame(data_dict)

In [3]:
data_df

Unnamed: 0,country,capital,area,population
0,Brazil,Brasilia,8.516,200.4
1,Russia,Moscow,17.1,143.5
2,India,New Dehli,3.286,1252.0
3,China,Beijing,9.597,1357.0
4,South Africa,Pretoria,1.221,52.98


# Importing Tabular Data with Pandas

pandas is preferred because it imports the data directly into a DataFrame -- the data structure of choice for tabular data in Python.

In [4]:
# use read_csv to import flat file from the url
# https://raw.githubusercontent.com/pp-ct/scg_python/main/data/planes.csv

planes = pd.read_csv('https://raw.githubusercontent.com/pp-ct/scg_python/main/data/planes.csv')

In [5]:
planes

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan
...,...,...,...,...,...,...,...,...,...
3317,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3318,N997DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS AIRCRAFT CO,MD-88,2,142,,Turbo-fan
3319,N998AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3320,N998DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS CORPORATION,MD-88,2,142,,Turbo-jet


In [6]:
# Help with ?
pd.read_csv?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m[0;34m,[0m [0mIO[0m[0;34m[[0m[0;34m~[0m[0mAnyStr[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m=[0m[0;34m','[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m=[0m[0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msqueeze[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprefix[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0

In [7]:
# show first 5 rows
planes.head(5)

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan


In [8]:
# show last 5 rows
planes.tail(5)

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
3317,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3318,N997DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS AIRCRAFT CO,MD-88,2,142,,Turbo-fan
3319,N998AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3320,N998DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS CORPORATION,MD-88,2,142,,Turbo-jet
3321,N999DN,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS CORPORATION,MD-88,2,142,,Turbo-jet


# Selecting and Filtering

## Subsetting Dimensions

* We don't always want all of the data in a DataFrame, so we need to take subsets of the DataFrame.
* In general, **subsetting** is extracting a small portion of a DataFrame -- making the DataFrame smaller.
* Since the DataFrame is two-dimensional, there are two dimensions on which to subset.

**Dimension 1:** We may only want to consider certain *variables*.

For example, we may only care about the `year` and `engines` variables:

We call this selecting columns/variables -- this is similar SQL's SELECT or R's dplyr package's select().

In [9]:
planes.sample(5)

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
1821,N602LR,2008.0,Fixed wing multi engine,BOMBARDIER INC,CL-600-2D24,2,95,,Turbo-fan
176,N14219,1998.0,Fixed wing multi engine,BOEING,737-824,2,149,,Turbo-fan
1588,N544UW,2011.0,Fixed wing multi engine,AIRBUS,A321-231,2,379,,Turbo-fan
83,N12564,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan
2856,N8886A,2003.0,Fixed wing multi engine,BOMBARDIER INC,CL-600-2B19,2,55,,Turbo-fan


In [10]:
# Select only year and engines columns
planes[['year', 'engines']]

Unnamed: 0,year,engines
0,2004.0,2
1,1998.0,2
2,1999.0,2
3,1999.0,2
4,2002.0,2
...,...,...
3317,2002.0,2
3318,1992.0,2
3319,2002.0,2
3320,1992.0,2


**Dimension 2:** We may only want to consider certain *cases*.

For example, we may only care about the cases where the manufacturer is Embraer.

We call this **filtering** or **slicing** -- this is similar to SQL's `WHERE` or R's dplyr package's `filter()` or `slice()`.

In [11]:
# filter only Embraer 
planes[planes['manufacturer'] == 'BOEING']

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
34,N11206,2000.0,Fixed wing multi engine,BOEING,737-824,2,149,,Turbo-fan
49,N1200K,1998.0,Fixed wing multi engine,BOEING,767-332,2,330,,Turbo-fan
50,N1201P,1998.0,Fixed wing multi engine,BOEING,767-332,2,330,,Turbo-fan
51,N12109,1994.0,Fixed wing multi engine,BOEING,757-224,2,178,,Turbo-jet
52,N12114,1995.0,Fixed wing multi engine,BOEING,757-224,2,178,,Turbo-jet
...,...,...,...,...,...,...,...,...,...
3311,N994AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3313,N995AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3315,N996AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3317,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan


And we can combine these two options to subset in both dimensions -- the `year` and `engines` variables where the manufacturer is Embraer:

## Subsetting and Filtering into a New DataFrame

In the previous example, we want to do two things using `planes`:

  1. **select** the `year` and `engines` variables
  2. **filter** to cases where the manufacturer is Embraer

But we also want to return a new DataFrame -- not just highlight certain cells. Therefore:
3. Return a DataFrame to continue the analysis

In [12]:
# filter EMBRAER with 2 engines and year 2004
planes[(planes['manufacturer'] == 'EMBRAER') & (planes['engines'] == 2) & (planes['year'] == 2004)]

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
20,N11155,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
21,N11164,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
22,N11165,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
23,N11176,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
61,N12157,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
62,N12160,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
63,N12163,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
64,N12166,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
65,N12167,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan


In [13]:
planes['manufacturer'] = planes['manufacturer'].str.upper()

In [14]:
# filter EMBRAER or BOEING
planes[(planes['manufacturer'] == 'EMBRAER') | (planes['manufacturer'] == 'BOEING')]

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan
10,N11106,2002.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
11,N11107,2002.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
12,N11109,2002.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
...,...,...,...,...,...,...,...,...,...
3311,N994AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3313,N995AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3315,N996AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3317,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan


We can slice cases/rows using the values in the Index and bracket subsetting notation. It's common practice to use .loc to slice cases/rows:

In [15]:
# from 0:5
planes.loc[34:40]

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
34,N11206,2000.0,Fixed wing multi engine,BOEING,737-824,2,149,,Turbo-fan
35,N112US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
36,N113UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
37,N114UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
38,N11535,2001.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan
39,N11536,2001.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan
40,N11539,2001.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan


We can also pass a `list` of Index values:

In [16]:
# index 0, 2, 4, 8
planes.loc[[0, 2, 4, 8]]

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan
8,N109UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan


Use condition with loc

In [17]:
planes.loc[planes['year'] == 2009]

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
86,N125UW,2009.0,Fixed wing multi engine,AIRBUS,A320-214,2,182,,Turbo-fan
87,N126UW,2009.0,Fixed wing multi engine,AIRBUS,A320-214,2,182,,Turbo-fan
106,N131EV,2009.0,Fixed wing multi engine,BOMBARDIER INC,CL-600-2D24,2,95,,Turbo-fan
109,N132EV,2009.0,Fixed wing multi engine,BOMBARDIER INC,CL-600-2D24,2,95,,Turbo-fan
110,N133EV,2009.0,Fixed wing multi engine,BOMBARDIER INC,CL-600-2D24,2,95,,Turbo-fan
...,...,...,...,...,...,...,...,...,...
3130,N939WN,2009.0,Fixed wing multi engine,BOEING,737-7H4,2,140,,Turbo-fan
3134,N940WN,2009.0,Fixed wing multi engine,BOEING,737-7H4,2,140,,Turbo-fan
3139,N941WN,2009.0,Fixed wing multi engine,BOEING,737-7H4,2,140,,Turbo-fan
3143,N942WN,2009.0,Fixed wing multi engine,BOEING,737-7H4,2,140,,Turbo-fan


## Selecting Variables and Filtering Cases

If we want to select variables and filter cases at the same time, we have a few options:

1. Sequential operations
2. Simultaneous operations

In [18]:
#EMBRAER
planes_filtered = planes[planes['manufacturer'] == 'EMBRAER']
planes_filtered_and_selected = planes_filtered[['year', 'seats', 'manufacturer']]
planes_filtered_and_selected

Unnamed: 0,year,seats,manufacturer
0,2004.0,55,EMBRAER
4,2002.0,55,EMBRAER
10,2002.0,55,EMBRAER
11,2002.0,55,EMBRAER
12,2002.0,55,EMBRAER
...,...,...,...
3224,2008.0,20,EMBRAER
3233,2008.0,20,EMBRAER
3241,2008.0,20,EMBRAER
3250,2008.0,20,EMBRAER


In [19]:
planes.loc[planes['manufacturer'] == 'EMBRAER', ['year', 'seats', 'manufacturer']]

Unnamed: 0,year,seats,manufacturer
0,2004.0,55,EMBRAER
4,2002.0,55,EMBRAER
10,2002.0,55,EMBRAER
11,2002.0,55,EMBRAER
12,2002.0,55,EMBRAER
...,...,...,...
3224,2008.0,20,EMBRAER
3233,2008.0,20,EMBRAER
3241,2008.0,20,EMBRAER
3250,2008.0,20,EMBRAER


# Creating Columns and Manipulating

> During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time.
>
> \- Wes McKinney, the creator of Pandas, in his book *Python for Data Analysis*

## Creating New Columns

It's common to want to modify a column of a DataFrame, or sometimes even to create a new column.
Let's take a look at our planes data again.

In [20]:
planes

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan
...,...,...,...,...,...,...,...,...,...
3317,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3318,N997DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS AIRCRAFT CO,MD-88,2,142,,Turbo-fan
3319,N998AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3320,N998DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS CORPORATION,MD-88,2,142,,Turbo-jet


In [21]:
# For simplicity, let's say a full flight crew is always 5 people.
planes['capacity'] = planes['seats'] + 5

In [22]:
planes

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine,capacity
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan,60
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan,60
...,...,...,...,...,...,...,...,...,...,...
3317,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan,105
3318,N997DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS AIRCRAFT CO,MD-88,2,142,,Turbo-fan,147
3319,N998AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan,105
3320,N998DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS CORPORATION,MD-88,2,142,,Turbo-jet,147


In [23]:
planes['seats_per_engine'] = planes['seats'] / planes['engines']
planes

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine,capacity,seats_per_engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan,60,27.5
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan,60,27.5
...,...,...,...,...,...,...,...,...,...,...,...
3317,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan,105,50.0
3318,N997DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS AIRCRAFT CO,MD-88,2,142,,Turbo-fan,147,71.0
3319,N998AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan,105,50.0
3320,N998DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS CORPORATION,MD-88,2,142,,Turbo-jet,147,71.0


In [24]:
planes['summary'] = planes['manufacturer'] + ' | ' + planes['engine']
planes.head()

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine,capacity,seats_per_engine,summary
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan,60,27.5,EMBRAER | Turbo-fan
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0,AIRBUS INDUSTRIE | Turbo-fan
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0,AIRBUS INDUSTRIE | Turbo-fan
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0,AIRBUS INDUSTRIE | Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan,60,27.5,EMBRAER | Turbo-fan


In [25]:
planes['lower_manufacturer'] = planes['manufacturer'].str.lower()
planes.head()

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine,capacity,seats_per_engine,summary,lower_manufacturer
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan,60,27.5,EMBRAER | Turbo-fan,embraer
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0,AIRBUS INDUSTRIE | Turbo-fan,airbus industrie
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0,AIRBUS INDUSTRIE | Turbo-fan,airbus industrie
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,187,91.0,AIRBUS INDUSTRIE | Turbo-fan,airbus industrie
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan,60,27.5,EMBRAER | Turbo-fan,embraer


## Mapping Values

In [26]:
data_df

Unnamed: 0,country,capital,area,population
0,Brazil,Brasilia,8.516,200.4
1,Russia,Moscow,17.1,143.5
2,India,New Dehli,3.286,1252.0
3,China,Beijing,9.597,1357.0
4,South Africa,Pretoria,1.221,52.98


In [27]:
country_dict

{'Brazil': 'BR',
 'Russia': 'RU',
 'India': 'IN',
 'China': 'CH',
 'South Africa': 'SA'}

In [28]:
data_df['short_name'] = data_df['country'].map(country_dict)
data_df

Unnamed: 0,country,capital,area,population,short_name
0,Brazil,Brasilia,8.516,200.4,BR
1,Russia,Moscow,17.1,143.5,RU
2,India,New Dehli,3.286,1252.0,IN
3,China,Beijing,9.597,1357.0,CH
4,South Africa,Pretoria,1.221,52.98,SA


# Summarizing Data

In [29]:
# Describe function
planes.describe()

Unnamed: 0,year,engines,seats,speed,capacity,seats_per_engine
count,3252.0,3322.0,3322.0,23.0,3322.0,3322.0
mean,2000.48401,1.995184,154.316376,236.782609,159.316376,77.064996
std,7.193425,0.117593,73.654974,149.759794,73.654974,36.589602
min,1956.0,1.0,2.0,90.0,7.0,0.5
25%,1997.0,2.0,140.0,107.5,145.0,70.0
50%,2001.0,2.0,149.0,162.0,154.0,74.5
75%,2005.0,2.0,182.0,432.0,187.0,91.0
max,2013.0,4.0,450.0,432.0,455.0,200.0


In [30]:
#unique values
planes['manufacturer'].unique()

array(['EMBRAER', 'AIRBUS INDUSTRIE', 'BOEING', 'AIRBUS',
       'BOMBARDIER INC', 'CESSNA', 'JOHN G HESS', 'GULFSTREAM AEROSPACE',
       'SIKORSKY', 'PIPER', 'AGUSTA SPA', 'PAIR MIKE E', 'DOUGLAS',
       'BEECH', 'BELL', 'AVIAT AIRCRAFT INC', 'STEWART MACO',
       'LEARJET INC', 'MCDONNELL DOUGLAS', 'CIRRUS DESIGN CORP',
       'HURLEY JAMES LARRY', 'KILDALL GARY', 'LAMBERT RICHARD',
       'BARKER JACK L', 'AMERICAN AIRCRAFT INC', 'ROBINSON HELICOPTER CO',
       'FRIEDEMANN JON', 'LEBLANC GLENN T', 'MARZ BARRY', 'DEHAVILLAND',
       'CANADAIR', 'CANADAIR LTD', 'MCDONNELL DOUGLAS CORPORATION',
       'MCDONNELL DOUGLAS AIRCRAFT CO', 'AVIONS MARCEL DASSAULT'],
      dtype=object)

In [31]:
#value count
planes['manufacturer'].value_counts(normalize=True)

BOEING                           0.490668
AIRBUS INDUSTRIE                 0.120409
BOMBARDIER INC                   0.110777
AIRBUS                           0.101144
EMBRAER                          0.090006
MCDONNELL DOUGLAS                0.036123
MCDONNELL DOUGLAS AIRCRAFT CO    0.031005
MCDONNELL DOUGLAS CORPORATION    0.004214
CANADAIR                         0.002709
CESSNA                           0.002709
PIPER                            0.001505
BEECH                            0.000602
BELL                             0.000602
AMERICAN AIRCRAFT INC            0.000602
GULFSTREAM AEROSPACE             0.000602
STEWART MACO                     0.000602
CANADAIR LTD                     0.000301
AVIONS MARCEL DASSAULT           0.000301
FRIEDEMANN JON                   0.000301
CIRRUS DESIGN CORP               0.000301
MARZ BARRY                       0.000301
ROBINSON HELICOPTER CO           0.000301
PAIR MIKE E                      0.000301
HURLEY JAMES LARRY               0

In [32]:
# Check null value
planes['year'].isna().sum()

70

## Summary Methods

In [33]:
flights = pd.read_csv('https://raw.githubusercontent.com/pp-ct/scg_python/main/data/flights.csv')

In [34]:
flights.sample(5)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
228887,2013,6,8,828.0,829,-1.0,1004.0,1034,-30.0,MQ,4478,N722MQ,LGA,DTW,76.0,502,8,29,2013-06-08 08:00:00
309934,2013,9,2,550.0,600,-10.0,813.0,834,-21.0,B6,27,N712JB,EWR,MCO,121.0,937,6,0,2013-09-02 06:00:00
211661,2013,5,20,1647.0,1648,-1.0,1909.0,1918,-9.0,UA,562,N464UA,EWR,DEN,209.0,1605,16,48,2013-05-20 16:00:00
68293,2013,11,14,1156.0,1158,-2.0,1455.0,1506,-11.0,B6,1129,N510JB,JFK,RSW,160.0,1074,11,58,2013-11-14 11:00:00
115206,2013,2,5,1555.0,1600,-5.0,1859.0,1909,-10.0,B6,157,N586JB,JFK,MCO,146.0,944,16,0,2013-02-05 16:00:00


In [35]:
# sum
flights['distance'].sum()

350217607

In [36]:
# mean
flights['distance'].mean()

1039.9126036297123

In [37]:
# median
flights['distance'].median()

872.0

In [38]:
# mode
flights['distance'].mode()

0    2475
dtype: int64

In [39]:
# % value count
flights['distance'].value_counts()

2475    11262
762     10263
733      8857
2586     8204
544      6168
        ...  
17          1
892         1
964         1
865         1
604         1
Name: distance, Length: 214, dtype: int64

In [40]:
# describe
print(flights['distance'].dtype)
flights['distance'].describe()

int64


count    336776.000000
mean       1039.912604
std         733.233033
min          17.000000
25%         502.000000
50%         872.000000
75%        1389.000000
max        4983.000000
Name: distance, dtype: float64

In [41]:
# describe
print(flights['carrier'].dtype)
flights['carrier'].describe()

object


count     336776
unique        16
top           UA
freq       58665
Name: carrier, dtype: object

## The Aggregation Method

In [42]:
flights.agg({
    'sched_dep_time': ['mean'],
    'dep_time': ['mean']
})

Unnamed: 0,sched_dep_time,dep_time
mean,1344.25484,1349.109947


In [43]:
# Your turn 1) distance: min, max, mean | 2) air_time: mean

flights.agg({'distance': ['min', 'max', 'mean'], 'air_time': ['mean', 'min']})

Unnamed: 0,distance,air_time
max,4983.0,
mean,1039.912604,150.68646
min,17.0,20.0


In [44]:
flights.describe(include = ['int', 'float', 'object'])

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
count,336776.0,336776.0,336776.0,328521.0,336776.0,328521.0,328063.0,336776.0,327346.0,336776,336776.0,334264,336776,336776,327346.0,336776.0,336776.0,336776.0,336776
unique,,,,,,,,,,16,,4043,3,105,,,,,6936
top,,,,,,,,,,UA,,N725MQ,EWR,ORD,,,,,2013-09-20 08:00:00
freq,,,,,,,,,,58665,,575,120835,17283,,,,,94
mean,2013.0,6.54851,15.710787,1349.109947,1344.25484,12.63907,1502.054999,1536.38022,6.895377,,1971.92362,,,,150.68646,1039.912604,13.180247,26.2301,
std,0.0,3.414457,8.768607,488.281791,467.335756,40.210061,533.264132,497.457142,44.633292,,1632.471938,,,,93.688305,733.233033,4.661316,19.300846,
min,2013.0,1.0,1.0,1.0,106.0,-43.0,1.0,1.0,-86.0,,1.0,,,,20.0,17.0,1.0,0.0,
25%,2013.0,4.0,8.0,907.0,906.0,-5.0,1104.0,1124.0,-17.0,,553.0,,,,82.0,502.0,9.0,8.0,
50%,2013.0,7.0,16.0,1401.0,1359.0,-2.0,1535.0,1556.0,-5.0,,1496.0,,,,129.0,872.0,13.0,29.0,
75%,2013.0,10.0,23.0,1744.0,1729.0,11.0,1940.0,1945.0,14.0,,3465.0,,,,192.0,1389.0,17.0,44.0,
