# Pandas
---
Notebook by Alice Hsu (Mar 2025)

Concepts covered:
* Reading in data with Pandas
* Accessing and changing data
* Methods for exploring data
* Methods for cleaning data
* Methods for analyzing data
* Practical with Titanic Survivors dataset
---

# Introduction to [`Pandas`](https://pandas.pydata.org/docs/)

[`Pandas`](https://pandas.pydata.org/docs/) is a Python package used for working with large datasets of different types of data (e.g., numbers and names). It is very useful for observational/statistical datasets, as they look a lot like Excel spreadsheets.

It is very commonly used for working with data with many named features, or labels. Pandas is typically imported using the `pd` alias:

                                    import pandas as pd

For example, a market research dataset may contain demographic information about their customers. The <span style="color:blue">**rows**</span> (or <span style="color:blue">**index**</span>) in the dataset would represent each customer, while the <span style="color:darkred">**columns**</span> would represent the different characteristics of each customer.

<img src="figures/cust_df.png" style="height:90%; width:90%;">

Pandas is very powerful because it has very streamlined functions and methods for:
* Reading in, cleaning, merging, and formatting data
* Statistical analysis
* Hierarchically labelling data

It is also compatible with many other useful packages, including Seaborn, a statistical plotting package.

Some other perks:
* Can easily handle **missing data**
* Columns of **data can be inserted and/or deleted** from loaded data sets (size mutability)
* Data can be automatically or explicitly aligned to a set of labels (data alignment)
* **GroupBy functionality** to analyze data sets by different indices or labels
* Can **convert** ragged (weirdly sized) or differently-indexed data in **other data structures (e.g., `numpy` array, Python `list`) into `pandas` objects**
* Can access, subset, or index large datasets with **label or multiple labels**
    * **Hierarchical labeling** of axes (possible to have multiple labels per tick) (MultiIndex)
* Intuitive **merging and joining** data sets
* Flexible **reshaping and pivoting** (i.e., mirroring) of data sets
* Many tools/packages can load data from **different file types** into `Pandas` DataFrames:
    * CSV, txt, and other delimited files --> `pd.read_csv()`
    * Excel/ODS files --> `pd.read_excel()`
    * netCDF and HDF5 format --> `xarray` package
    * See [IO tools documentation](https://pandas.pydata.org/docs/user_guide/io.html) for more information on what function to use for what data type
* Built in functionality for handling **time series**
    * `Pandas datetime`

[`Pandas`](https://pandas.pydata.org/docs/) is built on top of [`numpy`](https://numpy.org/doc/stable/index.html) and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

The community standard to import `pandas` is to use `pd` as an alias:

In [3]:
import pandas as pd

# Import other useful modules
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## Pandas DataFrames

Data in Pandas is typically handled as a Pandas DataFrame array. You can create a Pandas DataFrame by simply using the `DataFrame()` function.

Pandas DataFrames are similar to NumPy arrays, except datapoints in Pandas DataFrames have names, or **labels**.

The key inputs into this function are:
* The `data`: each of your data points
* The `index`: the label you want to assign to each of your data points
* The `columns`: the name of the columns in your data

### Creating DataFrames

There are several ways to create DataFrames. The most convenient way will depend on how your data is entered into your notebook.

For example, if your data is formatted such that you have **each row** of data in a list, then you can create the DataFrame by specifying your data (`data`), your indexes (`index`), and your column names (`columns`).

Note that the thing you set **`index` equal to must have the same number of entries** as your data, and the thing you set **column equal to must have the same number of columns** as your data.

In [11]:
my_data = [['F',23,45000,'MS','USA'],
           ['M',34,36000,'BS','England'],
           ['NB',24,72000,'BS','USA'],
           ['F',65,32000,'HS','China']]
cust_df = pd.DataFrame(data=my_data,
index=['cust01','cust02','cust03','cust04'],
columns=['Gender','Age','Income USD','Education','Country'])

cust_df

Unnamed: 0,Gender,Age,Income USD,Education,Country
cust01,F,23,45000,MS,USA
cust02,M,34,36000,BS,England
cust03,NB,24,72000,BS,USA
cust04,F,65,32000,HS,China


**What is the shape of `cust_df`? Check here using the `.shape` property.**

In [None]:
#### YOUR CODE HERE ####


You can also create Pandas DataFrames from **Python dictionaries** (created using the curly braces, {}).

**Python dictionaries** are conducive to creating DataFrames because Pandas can recognize the **dictionary keys as column names** and the **dictionary values as values**.

In [5]:
# My data
temps = [31.5, 32.5, 30.2, 'err']
pressures = [100.0, 'err', 180.0, 190.0]
exps = ['Exp1','Exp2','Exp3','Exp4']

# Create a DataFrame with my data
d = {'Temp':temps, 'Pres':pressures}
exp_df = pd.DataFrame(data=d,index=exps)
print(exp_df)

      Temp   Pres
Exp1  31.5  100.0
Exp2  32.5    err
Exp3  30.2  180.0
Exp4   err  190.0


Here, we have 2 columns of data that correspond with different experiments.
* What does the **index** represent in the DataFrame `df`? 
* What are the **names of the 2 columns of data**? What do each of these columns represent?
* Note the shape of each of the variables in my data (`temps`, `pressures`, `exps`). **What must be true about each of their shapes in order to make a DataFrame?**

When making a DataFrame, it is generally useful to ask yourself how the DataFrame will be structured - i.e., what will the **index** and the **columns** represent?
***
**NOTE**: More often than not, you will probably **load in an existing dataset** rather than creating the dataset within your Python code, so don't worry too much about the nuances of creating DataFrames.

Pandas has a range of built-in functions used for reading in different file types, including Excel (.xls or .xlsx), text (.txt), or CSV (.csv). In our examples today, we will use the Pandas read_csv() function to read in comma-separated files (CSV).

### Accessing Data in DataFrames
    
In the following examples, we will be using the **temperature and pressure** data below.

In [6]:
# My data
temps = [31.5, 32.5, 30.2, 'err']
pressures = [100.0, 'err', 180.0, 190.0]
exps = ['Exp1','Exp2','Exp3','Exp4']

# Create a DataFrame with my data
d = {'Temp':temps, 'Pres':pressures}
exp_df = pd.DataFrame(data=d,index=exps)
exp_df

Unnamed: 0,Temp,Pres
Exp1,31.5,100.0
Exp2,32.5,err
Exp3,30.2,180.0
Exp4,err,190.0


### Accessing data using the `.loc[]` method

One reason DataFrames are great for data analysis is because **instead of using positional indices** (e.g., 0, 1, 2) to access your data, **you can use actual names of variables**, known as **labels**, instead.

There are a range of different Pandas DataFrames methods and syntax you can use to access and modify data in DataFrames.

However, the **most common and robust method is the `.loc[]` (“location”) method**, which accesses values by their labels. 

**Access a <u>column</u> of data**:

In [None]:
exp_df['Temp']

In [8]:
exp_df.loc[:,'Temp']

How would you access the **pressure data (`Pres`) for all the experiments**?

In [None]:
#### YOUR CODE HERE ####


**Access a <u>row</u> of data**:

In [None]:
exp_df.loc['Exp1']

How would you access the **measurements for Experiment 4 (`Exp4`)**?

In [None]:
#### YOUR CODE HERE ####


**Access a <u>single value</u>**:

In [33]:
exp_df.loc['Exp1','Temp']

How would you access the **error value in Experiment 2 (`Exp2`)**?

In [None]:
#### YOUR CODE HERE ####


### Accessing data using the `.iloc[]` method

Sometimes you might need to access data in your DataFrame by its **positional index**, perhaps because you don’t know its label. In this case, you can use the `.iloc[]` (“integer location”) method. This indexing is identical to how we index NumPy arrays.

**Access a <u>column</u> of data**:

In [38]:
exp_df.iloc[:,0]

How would you access the **second column of data**?

In [None]:
#### YOUR CODE HERE ####


**Access a <u>row</u> of data**:

In [37]:
exp_df.iloc[1]

How would you access the **third row of data**?

In [None]:
#### YOUR CODE HERE ####


**Access a <u>single value</u>**:

In [None]:
exp_df.iloc[1,1]

How would you access the **error value in Experiment 2 (`Exp2`)** using `iloc[]`?

In [None]:
#### YOUR CODE HERE ####


### Modifying Data in a DataFrame

You can change data within a DataFrame by simply specifying the data point(s) you want to change using the `.loc[]` method, and then setting that equal to the new data.

Note that, similar to NumPy arrays, the new data must be a compatible shape with the data you are replacing.

In [13]:
exp_df

Unnamed: 0,Temp,Pres
Exp1,31.5,100.0
Exp2,32.5,err
Exp3,30.2,180.0
Exp4,err,190.0


For example, let's say we want to replace the error in `Exp4` in the temperature and pressure dataset, `exp_df`.

We could do this by simply accessing the value via `.loc[]`, and then setting it equal to the value we want to replace it with:

In [16]:
exp_df.loc['Exp4','Temp'] = 30
exp_df

Unnamed: 0,Temp,Pres
Exp1,31.5,100.0
Exp2,32.5,err
Exp3,30.2,180.0
Exp4,30.0,190.0


We can also **replace multiple values** like the slices in NumPy arrays; however, note that the "slice" you replace the data with must be of the same size.

In [20]:
exp_df.loc[['Exp1','Exp2'],'Pres'] = [110,120]
exp_df

Unnamed: 0,Temp,Pres
Exp1,31.5,110.0
Exp2,32.5,120.0
Exp3,30.2,180.0
Exp4,30.0,190.0


### <span style="color:blue"><b>Exercise 1:</span></b>

Recall the customer dataset we had at the beginning:

In [12]:
cust_df

Unnamed: 0,Gender,Age,Income USD,Education,Country
cust01,F,23,45000,MS,USA
cust02,M,34,36000,BS,England
cust03,NB,24,72000,BS,USA
cust04,F,65,32000,HS,China


Extract **customer 3's age** from the dataset.

In [None]:
#### YOUR CODE HERE ####


How would you access the **income** of the **last 3 customers** in the dataset?

In [None]:
#### YOUR CODE HERE ####


How would you **change customer 1's and customer 4's income** to 40000?

In [None]:
#### YOUR CODE HERE ####


## Methods and Properties for exploring Pandas DataFrames

When you first load in a dataset, you will want to know how it is formatted and what is in it. Pandas DataFrames have a range of useful properties and methods for figuring out what is in your dataset.

The syntax for using a method or property is your variable + a period + the method or property, with any relevant inputs to that method in parentheses. Note that properties don't have inputs, so you don't need the parentheses to call them. In the table below, the following methods are being performed on a DataFrame called `df`.

|Method/Property|Description|
|:-|-|
|`df.shape`|Get the shape (i.e., the dimensions; # of rows, columns) of the DataFrame
|`df.index`|See what's being used as the index (i.e., the vertical axis, or the labels for each of the rows)
|`df.columns`|See the column names (i.e., the horizontal axis, or the labels for each of the columns)
|`df.dtypes`|See the data types contained in each column of your DataFrame; if your column contains more than one data type, it will be labeled as an object.
|`df.head(n), df.tail(n)`|See the first (head) or last (tail) n values in your DataFrame. If you do not specify n, it defaults to the first or last 5 rows.
|`df.info()`|See a general summary of your DataFrame by column, including data types, data points, and memory usage

### Loading in CSV Data

In the following section, we will be **loading in a CSV file in as a Pandas DataFrame** and using some of the methods/properties in the table above to explore what is in the data.

The first thing you need to do when loading in a dataset is **finding where it is located on your computer/machine**.

Our data file should be `data` folder you downloaded. For the following example, we will be using an example dataset called `mpg.csv`.

In [22]:
# Get the file path
fname = '../../data/mpg_data.csv'

We will load in the data using [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function, since we are working with a CSV (`.csv`) file. However, you can see how to load other data formats [here](https://pandas.pydata.org/pandas-docs/stable/reference/io.html).

In [23]:
mpg_data = pd.read_csv(fname)

# To display the data:
mpg_data

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,make,model
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet,chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick,skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth,satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc,rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford,torino
...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,usa,ford,mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw,pickup
395,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge,rampage
396,28.0,4,120.0,79.0,2625,18.6,82,usa,ford,ranger


### Data Exploration

**A.** If you have a particularly large dataset and want to **see just a few rows**:
* Use the `.head(n)` method to see the first *n* rows
* Use the `.tail(n)` method to see the last *n* rows
* If you do not specify *n*, it defaults to the first or last 5 rows

In [82]:
#### YOUR CODE HERE ####


In [35]:
#### YOUR CODE HERE ####


**B.** If you want to get the **shape** of your DataFrame:
* Use the `.shape` method
* Note that the shape does **not** include the index column.

In [68]:
#### YOUR CODE HERE ####


**C.** See what's being used as the **index** (i.e., the **vertical** axis, or the **labels for each of the <u>rows<u>**):

In [66]:
#### YOUR CODE HERE ####


**D.** See the **column names** (i.e., the **horizontal** axis, or the **labels for each of the <u>columns<u>**):

In [67]:
#### YOUR CODE HERE ####


**E.** If you want to know the **data type** in each column:

In [50]:
#### YOUR CODE HERE ####


**F.** See **general information** all at once, including memory usage:

In [49]:
#### YOUR CODE HERE ####


Using the methods and properties above, answer the following questions about `mpg_data`.

**1.** What are the **columns** in the dataset? How many columns are there?

**2.** What is being used as the **index** in the DataFrame?

**3.** What different **types of data** are present in the dataset?

**4.** How many **total data points** are there?

# Data Cleaning

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">The truth about data science: cleaning your data is 90% of the work. Fitting the model is easy. Interpreting the results is the other 90%.</p>&mdash; Jake VanderPlas (@jakevdp) <a href="https://twitter.com/jakevdp/status/742406386525446144">June 13, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Often, you will need to clean datasets that you have. This can involve removing certain data points that are problematic. The functions below can help you do this:

|Method/Property|Description|
|:-|-|
|`df.drop(label, axis=axis)`|Drop data by either their column name or index label. If you specify a column name, you have to set the `axis` argument to `axis=1`.
|`df.dropna(how=how,subset=[column names(s)])`|Drop data points for which the value is missing or NaN. You can specify whether you want it to drop rows for which all the column values are NaNs (`how='all'`), or for which any of the column values are NaNs (`how='any'`). You can also specify the  columns you want to check for NaNs using the subset argument.
|`df.drop_duplicates(subset=[column name(s)])`|Drop data points that are duplicated. You can specify the specific columns you want it to check for duplicates using the subset argument.

## Dropping Data
    
Run the following cells to see how the different methods `drop()`, `dropna()`, and `drop_duplicates()` work on `df`.

In [71]:
df = pd.DataFrame([[4,2,7,3],[1,np.nan,6,2],[np.nan,np.nan,np.nan,np.nan],
                  [8,3,1,9],[np.nan,3,0,np.nan],[4,2,7,3]],
                 columns=['col1','col2','col3','col4'],
                 index=['idx1','idx2','idx3','idx4','idx5','idx6'])
df

Unnamed: 0,col1,col2,col3,col4
idx1,4.0,2.0,7.0,3.0
idx2,1.0,,6.0,2.0
idx3,,,,
idx4,8.0,3.0,1.0,9.0
idx5,,3.0,0.0,
idx6,4.0,2.0,7.0,3.0


### Using the `.drop()` method

If you **know the labels (e.g., the column and/or the row index)** of the data you want to drop, then you can use the **`.drop()` method**.

Sometimes you may want to drop **rows** of data:

* When cleaning data, it is very common to drop rows of data points that don't meet criteria for keeping. For example, some data points may be missing certain information in important columns, be outliers, or you may find some other reason a data point is invalid.

In [76]:
df.drop('idx2',axis=0)

* The **first argument (`'idx2'`)** in the `.drop()` method specifies the **label we want to drop**.
    * In this case, `'idx2'` signifies the index of the row we want to drop.
* The **second argument (`axis=0`)** specifies the **axis in which to look for the `'idx2'` label**. In this case, the `axis=0` says to look for the label in the row names (a `1` would tell it to look for the label in the columns).

You can also drop **columns** of data.

In [None]:
df.drop('col1',axis=1)

* The **first argument (`'col1'`)** in the `.drop()` method specifies the **label we want to drop**.
* The **second argument (`axis=1`)** specifies the **axis in which to look for the 'cylinders' label**. In this case, the `1` says to look for the label in thr column names (a `0` would tell it to look for the label in the row indexes).

### Using the `.dropna()` method

Often, datasets will have missing data in certain rows or columns. However, you may not know exactly where these are or what their labels are. The `.dropna()` is useful because it can find these null/NaN values for you!

In [None]:
df.dropna(how='any')

In [None]:
df.dropna(how='all')

* The argument **`how='any'`** in the `.dropna()` method tells Pandas to **look for rows for which there are _any_ null values**.
    * This means that if a row has even a single column with missing data, it will be dropped.
* You can also specify **`how='all'`**. This argument tells Pandas to **look for rows for which _all_ the columns have null values**.
    * This means that _all_ the columns in a row must be missing data in order for it to be dropped.

In [None]:
df.dropna(how='any',subset=['col2','col3'])

In [None]:
df.dropna(how='all',subset=['col2','col3'])

* The argument **`subset=['col2','col3']`** in the `.dropna()` method tells Pandas to look for rows with null values but **_only within columns `col2` and `col3`_**.
    * For `how='any'`, this means that if a row has a null value in **either** `col2` or `col3` it will be dropped.
    * For `how='all'`, this means that if a row has a null value in **both** `col2` or `col3` it will be dropped.

### Using the `.drop_duplicates()` method

Often in data analysis you will want to drop duplicates in your dataset. However, similar to the null values, you may not know where they are or what their labels are. The `drop_duplicates()` method is useful because it can find duplicates for you!

In [None]:
df.drop_duplicates()

In [None]:
df.drop_duplicates(subset=['col1'])

* The argument **`subset=['col1']`** in the `.drop_duplicates()` method tells Pandas to look for rows with duplicate values but **_only within columns `col1`_**. If you don't specify a subset, it will drop 
    * Note that NaNs count as duplicates.

### <span style="color:blue">Exercise 2:</span>

Consider the DataFrame `df` above:

||**col1**|**col2**|**col3**|**col4**|
|-|-|-|-|-|
|**idx1**|4|2|7|3|
|**idx2**|1|NaN|6|2|
|**idx3**|NaN|NaN|NaN|NaN|
|**idx4**|8|3|1|9|
|**idx5**|NaN|3|0|NaN|
|**idx6**|4|2|7|3|

**Using `dropna()`, modify `df` so that it looks like:**

||**col1**|**col2**|**col3**|**col4**|
|-|-|-|-|-|
|**idx1**|4|2|7|3|
|**idx4**|8|3|1|9|
|**idx5**|NaN|3|0|NaN|
|**idx6**|4|2|7|3|

In [None]:
#### YOUR CODE HERE ####


**Using `drop_duplicates()`, modify `df` so that it looks like:**

||**col1**|**col2**|**col3**|**col4**|
|-|-|-|-|-|
|**idx1**|4|2|7|3|
|**idx2**|1|NaN|6|2|
|**idx4**|8|3|1|9|

In [None]:
#### YOUR CODE HERE ####


# Methods for Analyzing Data

When working with DataFrames, we may want to **compute statistics** on our data, such as:
* `df.mean()`
* `df.median()`
* `df.quantile()`
* `df.std()`
* `df.sum()`

When calling these methods on a DataFrame, it will **calculate these statistics for each column** (where possible).

Run the cells below to see how the methods work.

In [117]:
mpg_data.head()

Unnamed: 0.1,Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,make,model
0,0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet,chevelle malibu
1,1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick,skylark 320
2,2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth,satellite
3,3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc,rebel sst
4,4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford,torino


In [89]:
mpg_data.mean()

In [None]:
mpg_data.median()

In [92]:
mpg_data.quantile(0.9)

In [96]:
mpg_data.std()

In [94]:
mpg_data.sum()

However, more often than not, we will want to compute these statistics on subsets of our datasets, instead of the whole thing.

For example, what if I wanted to look at the **mean mpg for cars from each origin**? The section below will cover how to do this.

### Boolean Masking

When analyzing data, often it is useful to extract certain data points that have properties in common with one another.

For example, consider the columns in `mpg_data`.

In [118]:
mpg_data.head()

Unnamed: 0.1,Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,make,model
0,0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet,chevelle malibu
1,1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick,skylark 320
2,2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth,satellite
3,3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc,rebel sst
4,4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford,torino


In this dataset, we might be interested in looking at just the **data for cars whose `origin` is `'europe'`**. But how can we extract this data?

Recall that we used the **`.loc[]`** method to extract **specific rows** of data. We can use `.loc[]`, along with a **boolean mask**, to filter out this data.

A **boolean mask** is just an array of `True` and `False`, where the locations of the `True` correspond to the data you want to extract, and the `False` correspond to the locations of the data you want to mask out.

A simple example:

Consider an array, `x`.

In [80]:
x = np.array([1,3,-3,2,8,4])

Suppose we wanted to extract all values **less than 3**. The boolean mask would then be:

In [81]:
x<3

array([ True, False,  True,  True, False, False])

Notice how the **boolean mask has the exact same shape as the data we are masking**. To apply the mask, we then type:

In [82]:
x[x<3]

array([ 1, -3,  2])

We can similarly create a boolean mask for a DataFrame and apply it via `loc[]`. So a **mask** for the **data points with an `origin` equal to `'europe'`** would look like:

In [24]:
mpg_data['origin']=='europe'

0      False
1      False
2      False
3      False
4      False
       ...  
393    False
394     True
395    False
396    False
397    False
Name: origin, Length: 398, dtype: bool

Then to apply the mask, we use `loc[]` like so:

In [87]:
mpg_data.loc[mpg_data['origin']=='europe']

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
19,26.0,4,97.0,46.0,1835,20.5,70,europe,volkswagen 1131 deluxe sedan
20,25.0,4,110.0,87.0,2672,17.5,70,europe,peugeot 504
21,24.0,4,107.0,90.0,2430,14.5,70,europe,audi 100 ls
22,25.0,4,104.0,95.0,2375,17.5,70,europe,saab 99e
23,26.0,4,121.0,113.0,2234,12.5,70,europe,bmw 2002
...,...,...,...,...,...,...,...,...,...
354,34.5,4,100.0,,2320,15.8,81,europe,renault 18i
359,28.1,4,141.0,80.0,3230,20.4,81,europe,peugeot 505s turbo diesel
360,30.7,6,145.0,76.0,3160,19.6,81,europe,volvo diesel
375,36.0,4,105.0,74.0,1980,15.3,82,europe,volkswagen rabbit l


Now that we've extracted the EU car data, we can look at the statistics for just these cars:

In [101]:
EU_mpg_data = mpg_data.loc[mpg_data['origin']=='europe']

EU_mpg_data.mean()

mpg               27.891429
cylinders          4.157143
displacement     109.142857
horsepower        80.558824
weight          2423.300000
acceleration      16.787143
model_year        75.814286
dtype: float64

What code would you write to compute the **mean mpg** for all **4-cylinder** cars?

In [None]:
#### YOUR CODE HERE ####


But what if we wanted to compute statistics for multiple groups of data within our DataFrame? Thankfully, Pandas has a function just for that! Introducing...

### Using the **`GroupBy()`** Method

`GroupBy()` is a method that allows you to group data by the different datapoints. It works by grouping data that is similar in the column(s) you specify.

You can then perform a statistical operation - e.g., taking the mean, standard deviaion, or simply counting - on that group.

For example, say you wanted to calculate the **mean** of the cars **from each different origin**. The `groupby()` method will identify all the unique origins, and then it will calculate the mean of each origin:

In [99]:
mpg_data.groupby(['origin']).mean()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
europe,27.891429,4.157143,109.142857,80.558824,2423.3,16.787143,75.814286
japan,30.450633,4.101266,102.708861,79.835443,2221.227848,16.172152,77.443038
usa,20.083534,6.248996,245.901606,119.04898,3361.931727,15.033735,75.610442


* In this line of code, the `['origin']` indicates that we will be grouping the data by all the unique values it can find in that column.
* The `.mean()` then indicates that for each group of data, we will calculate the mean for each column (where this is possible).

Often, it is useful just to see how many data points there are per group. You can do this using the `.size()` method:

In [97]:
mpg_data.groupby(['origin']).size()

origin
europe     70
japan      79
usa       249
dtype: int64

**Group the data by the `make` of the car** and **count how many data points** there are per `make`. Then, calculate the **median** for each `make`.

In [None]:
#### YOU CODE HERE ####
mpg_data.groupby()


Finally, note that you can also **group things by multiple columns**.

This first part of the code (`.groupby(['origin','make'])`) **groups everything first by `origin`** (`europe`,`japan`,`usa`), and then it **breaks those groups into smaller sub-groups** by the **`make`**.

The `.size()` part then tells it to compute the number of data points in each of the sub-groups.

In [127]:
mpg_data.groupby(['origin','make']).size()

# Practical: Working with the Titanic Survivors Dataset

This dataset is very commonly used to teach supervised machine learning - it has many columns that can be used as predictors but requires some cleaning. Perfect for applying some of the skills we just learned!

First, we will **load the dataset into our notebook**. Follow the instructions in the worksheet to complete this step.

In [124]:
ts_df = pd.read_csv('')

In [125]:
ts_df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


### Data Exploration

Print the dataset and take a look at the data in it. You may look back to the Methods and Properties for exploring Pandas DataFrames section for some helpful methods.

In [None]:
ts_df

### Data Cleaning

Note: make sure to set `inplace=True` for the following lines of code.

It looks like the `sex`, `who`, and `adult_male` columns all tell us the same information, so let's get rid of the `who` and `adult_male` columns. What code would you write to drop these columns?

You decide that the **age, the survival status, and class** are important variables for your analysis, so you'd like to **get rid of the data points that are missing _any_ of these variables (i.e., a NaN)**. What code would you write to get rid of these rows of data?

Just to be sure you dataset is robust, **get rid of any duplicates** in your dataset.

### Data Analysis

You want to know **how many people survived** the Titanic shipwreck in your dataset. What code could you write to do this?
_Hint: think about which column(s) you need for this._

You want to know **how many people who survived the Titanic shipwreck were under the age of 30**. What code could you write to do this?

_Hint: you can use `groupby()` or boolean masks to do this._

You want to know **what proportion of people who survived the Titanic shipwreck were male or female**. What code could you write to do this?

## Exporting data

A very useful feature of Pandas is the ability to write and export data to a `.csv` (comma seperated value) file that can be read easily by programs like Excel:

In [80]:
ts_data.to_csv(data_dir/'titanic_cleaned.csv', index=False)

# Other useful Pandas Operations

More examples: https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html

## Extra tutorials

Online tutorials with more in-depth operations used in pandas:

* [Kaggle tutorial](https://www.kaggle.com/learn/pandas)
* [Pandas official website Getting Started](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)

## References
* https://github.com/jonathanrocher/pandas_tutorial
* https://github.com/koldunovn/python_for_geosciences
* http://pandas.pydata.org/pandas-docs/stable/index.html#module-pandas
* http://pandas.pydata.org/pandas-docs/stable/10min.html
* https://towardsdatascience.com/linear-regression-in-6-lines-of-python-5e1d0cd05b8d