# Pandas DataFrames
In data science, the most important complex data structure is the **DataFrame**.
DataFrames are a collection of tabular data -- you might think of them as *tables* or *datasets*, depending on your background.

In [None]:
# Import pandas
import pandas as pd

In [None]:
country_dict =  {"Brazil":"BR", "Russia":"RU", "India":"IN", "China":"CH", "South Africa":"SA"}
data_dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
             "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
             "area": [8.516, 17.10, 3.286, 9.597, 1.221],
             "population": [200.4, 143.5, 1252, 1357, 52.98] }

# creat a dataframe from python dictionary
data_df = pd.DataFrame(data_dict)

# Importing Tabular Data with Pandas

pandas is preferred because it imports the data directly into a DataFrame -- the data structure of choice for tabular data in Python.

In [None]:
# use read_csv to import flat file from the url
# https://raw.githubusercontent.com/pp-ct/scg_python/main/data/planes.csv


In [None]:
# Help with ?
pd.read_csv?

In [None]:
# show first 5 rows

In [None]:
# show last 5 rows

# Selecting and Filtering

## Subsetting Dimensions

* We don't always want all of the data in a DataFrame, so we need to take subsets of the DataFrame.
* In general, **subsetting** is extracting a small portion of a DataFrame -- making the DataFrame smaller.
* Since the DataFrame is two-dimensional, there are two dimensions on which to subset.

**Dimension 1:** We may only want to consider certain *variables*.

For example, we may only care about the `year` and `engines` variables:

We call this selecting columns/variables -- this is similar SQL's SELECT or R's dplyr package's select().

In [None]:
planes.sample(5)

In [None]:
# Select only year and engines columns


**Dimension 2:** We may only want to consider certain *cases*.

For example, we may only care about the cases where the manufacturer is Embraer.

We call this **filtering** or **slicing** -- this is similar to SQL's `WHERE` or R's dplyr package's `filter()` or `slice()`.

In [None]:
# filter only Embraer 


And we can combine these two options to subset in both dimensions -- the `year` and `engines` variables where the manufacturer is Embraer:

## Subsetting and Filtering into a New DataFrame

In the previous example, we want to do two things using `planes`:

  1. **select** the `year` and `engines` variables
  2. **filter** to cases where the manufacturer is Embraer

But we also want to return a new DataFrame -- not just highlight certain cells. Therefore:
3. Return a DataFrame to continue the analysis

In [None]:
# filter EMBRAER with 2 engines and year 2004


In [None]:
# filter EMBRAER or BOEING


We can slice cases/rows using the values in the Index and bracket subsetting notation. It's common practice to use .loc to slice cases/rows:

In [None]:
# from 0:5

We can also pass a `list` of Index values:

In [None]:
# index 0, 2, 4, 8

Use condition with loc

## Selecting Variables and Filtering Cases

If we want to select variables and filter cases at the same time, we have a few options:

1. Sequential operations
2. Simultaneous operations

In [None]:
#EMBRAER
planes_filtered = 
planes_filtered_and_selected = 

In [None]:
planes.loc[]

# Creating Columns and Manipulating

> During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time.
>
> \- Wes McKinney, the creator of Pandas, in his book *Python for Data Analysis*

## Creating New Columns

It's common to want to modify a column of a DataFrame, or sometimes even to create a new column.
Let's take a look at our planes data again.

In [None]:
# For simplicity, let's say a full flight crew is always 5 people.
planes['capacity']

In [None]:
planes['seats_per_engine']

In [None]:
planes['summary'] 

In [None]:
planes['lower_manufacturer']

## Mapping Values

In [None]:
data_df

In [None]:
country_dict

In [None]:
data_df['short_name'] = 
data_df

# Summarizing Data

In [None]:
# Describe function
planes.describe()

In [None]:
#unique values
planes['manufacturer']

In [None]:
#value count
planes['manufacturer']

In [None]:
# Check null value
planes['year'].isna().sum()

## Summary Methods

In [None]:
flights = pd.read_csv('https://raw.githubusercontent.com/pp-ct/scg_python/main/data/flights.csv')

In [None]:
flights.sample(5)

In [None]:
# sum
flights['distance']

In [None]:
# mean
flights['distance']

In [None]:
# median
flights['distance']

In [None]:
# mode
flights['distance']

In [None]:
# % value count
flights['distance']

In [None]:
# describe
print(flights['distance'].dtype)
flights['distance']

In [None]:
# describe
print(flights['carrier'].dtype)
flights['carrier']

## The Aggregation Method

In [None]:
flights.agg({
    'sched_dep_time': ['mean'],
    'dep_time': ['mean']
})

In [None]:
# Your turn distance: min, max, mean | air_time: mean


In [None]:
flights.describe(include = ['int', 'float', 'object'])