<center> <img src="res/ds3000.png"> </center>

<center> <h1> Week 6 - Day 1 </h1> </center>

<center> <h2> Part 2: Data Wrangling Cont'd </h2></center>

## Outline
1. <a href='#1'>Multilevel/Hierarchical Indexing</a>
2. <a href='#2'>Working with Missing Values in DataFrames</a>
3. <a href='#3'>working with Duplicates</a>
4. <a href='#4'>Replacing Values</a>
5. <a href='#5'>Renaming Axis Indices</a>

<a id="1"></a>

## 1. Multilevel/Hierarchical Indexing
* Can index dataframes using two columns as indices
* Use the **set_index** method and pass in a **list of column names**

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("res/ave_grades.csv")

In [None]:
df

In [None]:
df = df.set_index("Student")

In [None]:
df

In [None]:
df.loc["Hermione Granger"]

#### reset_index() method
* resets the index
* promotes the current index into a column
* creates a default numbered index


In [None]:
df = df.reset_index()

In [None]:
df

In [None]:
#set multiple indices
df = df.set_index(["House", "Student"])

In [None]:
df

In [None]:
df = df.sort_index()

In [None]:
df

* Multilevel indices are stored as a list of tuples:

In [None]:
df.index

In [None]:
df.index.values

### 1.1. Selecting Rows from a Hierarchically-Indexed DataFrame

In [None]:
df.loc["Gryffindor"]

In [None]:
df.loc["Gryffindor", "Harry Potter"]

In [None]:
df.loc["Gryffindor"].describe()

In [None]:
df.loc["Slytherin"].mean()

#### How would you modify the code snippet above to retrieve the mean score for Syltherin on Potion_Ave?

In [None]:
df.loc["Slytherin", "Potion_Ave"].mean()

### 1.2. Selecting Multiple Rows
* When selecting multiple rows, provide a list of tuples corresponding to each row
    * e.g., if selecting two rows, provide a list of two tuples

In [None]:
df.loc[[("Gryffindor", "Harry Potter"), ("Slytherin", "Draco Malfoy")]]

In [None]:
df

### 1.3. Selecting Multiple Rows with Specific Columns
* Specify the list of columns as a second argument to **.loc[rows, columns]**

In [None]:
df.loc[[("Gryffindor", "Harry Potter"), ("Slytherin", "Draco Malfoy")], ["Potion_Ave"]]

<a id="2"></a>

## 2. Working with Missing Values in DataFrames
* Real-life datasets often contain missing values
* Need to do something about those missing values before analyzing the data

In [None]:
dada = pd.read_csv("res/DADA.csv")

In [None]:
dada

### Pandas and Missing Data
* Pandas automatically assigns **NaN**, "Not a Number", to missing values while reading a file
* By defaulty, empty cells are considered missing values, and hence, assigned NaN
* Can specify the missing value used to indicate missing data
    * use **na_values** keyword argument

In [None]:
pd.read_csv("res/DADA_NA.csv")

In [None]:
pd.read_csv("res/DADA_NA.csv", na_values = "-99")

### What to do with missing data?
* Filter out missing data
* Fill in missing data
* Ignore missing data

### 2.1. Filtering Out Missing Data
* The dropna() method allows you to drop rows and columns with missing data
* By default, dropna() drops any row containing a missing value

In [None]:
dada

In [None]:
dada_clean = dada.dropna()

In [None]:
dada_clean

In [None]:
dada

### 2.1.1. Filtering out empty rows
It appears that Row 12 is in the dataset by mistake.

**dropna(how = "all")** will only drop rows that are all NaN.
* If you want to drop columns in the same way, pass **axis=1**: **df.dropna(how="all", axis=1)**

In [None]:
dada= dada.dropna(how="all")

In [None]:
dada

* Alternatively, set skip_blank_lines=True when loading the dataset

In [None]:
pd.read_csv("res/DADA_NA.csv", skip_blank_lines=True) 

### 2.1.2. Specifying Number of Missing Values Before Filtering Out
* The **tresh** keyword argument specifies the number of non-NaN values a row should have so that it will not be filtered out
* **dropna(thresh=2)** means all rows containing less than 2 non-NaN values will be filtered out, or dropped.

* Suppose Professor Lupin has decided to calculate the midterm grade if students have performed five of the six spells.

In [None]:
dada.dropna(thresh=5)

### 2.2. Filling In Missing Data
* Instead of filtering out missing data, you can fill in missing data with some other values.
* the **fillna()** method allows you to fill in missing data with specific values

In [None]:
dada.fillna(0)

### Caution
* fillna returns a new DF
* If you want to update the original DF, set inplace=True
> ```python
dada.fillna(0, inplace = True)
```

### 2.2.1. Filling In a Different Value for Each Column
* Can fill in each column with a different value
* Pass in a dictionary containing column names as keys and desired NA values as values
    * **fillna(dictionary_of_values)**

In [None]:
default_col_values = {"Expelliarmus": 0, "Stupefy": 0, "Protego": 0, 
                      "Accio": 0, "Petrificus Totalus": 0, "Expecto Patronum": ""}

In [None]:
dada.fillna(default_col_values)

<a id="3"></a>

## 3. Working with Duplicates
* Sometimes datasets contain duplicates record, which you may want to discard

In [None]:
dada

In [None]:
dada = dada.append(dada.iloc[0])

In [None]:
dada

### 3.1. duplicated() method
Returns a boolean Series indicating whether each row is a duplicated (whether it has been observed in a previous row)

In [None]:
dada.duplicated()

### 3.2. drop_duplicates() method
* Returns a DataFrame where the duplicates have been dropped

In [None]:
dada.drop_duplicates()

* You can also specify a subset of columns to detect duplicates.
* Pass in a list of column names as an argument to the **drop_duplicates()** method call.

In [None]:
dada

In [None]:
dada.drop_duplicates(["Student"])

* By default, both duplicated() and drop_duplicates() keep the first observed value combination. 
* If you want to keep the last value, pass in **keep="last"**

In [None]:
dada.drop_duplicates(["Student"], keep="last")

<a id="4"></a>

## 4. Replacing Values
* Beyond filtering or fill in missing data, we can replace certain values with new values.
* **df.replace(current_value, new_value)**

In [None]:
dada_NA = pd.read_csv("res/DADA_NA.csv")

In [None]:
dada_NA

In [None]:
dada_NA.replace(-99, "NaN") #numpy has a specific attribute for NaN: np.nan

In [None]:
dada_NA

In [None]:
dada_NA.replace("Drace Malfoy", "Draco Malfoy", inplace=True)

In [None]:
dada_NA

* If you want to replace multiple values at once, pass a list and then the substitute value(s)

```python
dada_NA.replace([-99, -999], "NaN")
dada_NA.replace([-99, -999], ["NaN", 0])
dada_NA.replace({-99: "NaN", -999: 0})
```

<a id="5"></a>

## 5. Renaming Axis Indices
* Can rename rows and columns using the **rename()** method
* Pass **inplace=True** if you want to modify the original df

In [None]:
dada_NA = dada_NA.set_index("Student")

In [None]:
dada_NA

In [None]:
dada_NA = dada_NA.rename(index={"Harry Potter": "Harry", "Hermione Granger": "Hermione", "Ron Weasley": "Ron"})

In [None]:
dada_NA

In [None]:
dada_NA = dada_NA.rename(columns={"Petrificus Totalus": "Full-Body Bind"})

In [None]:
dada_NA