<center> <img src="res/ds3000.png"> </center>

<center> <h1> Week 6 - Day 1 </h1> </center>

<center> <h2> Part 1: Data Wrangling </h2></center>

## Outline
1. <a href='#1'>DataFrame Columns</a>
2. <a href='#2'>Sorting a DataFrame</a>
3. <a href='#3'>Querying a DataFrame</a>
4. <a href='#4'>Data Cleaning Example</a>

## Data Wrangling
* Data does not always come in forms ready for analysis
* Data could be
    * wrong format
    * incorrect 
    * missing
* Data scientists can spend as much as 75% of their time preparing data before they begin their studies
* Called **data munging** or **data wrangling**

<a id="1"></a>

## 1. DataFrame Columns

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("res/hp_grades.csv")

FileNotFoundError: File b'res/hp_grades.csv' does not exist

In [3]:
df

NameError: name 'df' is not defined

In [None]:
df = df.set_index("Student")

In [4]:
df

NameError: name 'df' is not defined

### 1.1. Renaming Columns

In [None]:
df =df.rename(columns = {"PTest1": "Potion1", "PTest2":"Potion2", "CTest1":"Charm1", "CTest2":"Charm2"})

In [None]:
df

### 1.2. Adding and Computing with Columns

In [None]:
df["Potion_Ave"] = (df["Potion1"] + df["Potion2"]) / 2

In [None]:
df

In [None]:
df["Charm_Ave"] = (df["Charm1"] + df["Charm2"]) / 2

In [None]:
df

In [None]:
pd.set_option("precision", 2)

In [None]:
df

### 1.3. unique() method
* Returns an array of the unique values in a column

In [None]:
df["House"].unique()

<a id="2"></a>

## 2. Sorting a DataFrame
* Can sort a `DataFrame` by its rows or columns, based on their indices or values

### 2.1. Sorting by Indices
* **sort_index()** method
    * Does not modify the original dataframe
    * By default, sorts in ascending order
* Sort the rows by their *indices* in _descending_ order using **`sort_index`** and its keyword argument `ascending=False` 

In [None]:
df.sort_index()

In [None]:
df.sort_index(ascending=False)

### 2.2. Sorting by Column Indices
* Sort columns into ascending order (left-to-right) by their column names
* **`axis=1` keyword argument** indicates that we wish to sort the _column_ indices, rather than the row indices
    * `axis=0` (the default) sorts the _row_ indices

In [None]:
df.sort_index(axis=1)

### 2.3. Sorting by Column Values
* To view `Potion_Ave`’s grades in descending order so we can see the students’ names in highest-to-lowest grade order, call method **`sort_values`**
* `by` and `axis` arguments work together to determine which values will be sorted
    * In this case, we sort based on the column values (`axis=1`) for `Potion_Ave`
    
* Does not modify the original DataFrame

In [None]:
df.sort_values(by="Charm_Ave", ascending=False)

In [None]:
df.sort_values(by="Potion_Ave", ascending=False)

In [None]:
df

* Can sort by multiple columns
* Just pass in a list of column names

In [None]:
df.sort_values(by=["Charm_Ave", "Potion_Ave"], ascending=False)

In [None]:
df.sort_values(by=["Potion_Ave", "Charm_Ave"], ascending=False)

In [None]:
#the original df remains "unsorted"
df

### 2.4. Copy vs. In-Place Sorting
* `sort_index` and `sort_values` return a _copy_ of the original `DataFrame`
* Could require substantial memory in a big data application
* Can sort _in place_ by passing the keyword argument `inplace=True` 

In [None]:
df.sort_values(by=["Potion_Ave", "Charm_Ave"], ascending=False, inplace=True)

In [None]:
df

<a id="3"></a>

## 3. Querying a DataFrame

### 3.1. Boolean Indexing
* One of pandas’ more powerful selection capabilities is **Boolean indexing**
* Select all Potion_Ave grades that are greater than or equal to 85:

In [None]:
df["Potion_Ave"] >= 85

In [None]:
type(df["Potion_Ave"] >= 85)

### 3.1.1 where() method
* Used to check a data frame for one or more condition and return the result accordingly. 
* By default, the rows not satisfying the condition are filled with NaN value.
* `df.where(cond, other, inplace)`

In [None]:
df85 = df.where(df["Potion_Ave"] >= 85)

In [None]:
df85

In [None]:
df85 = df.where(df["Potion_Ave"] >= 85)

* Pandas checks every grade to determine whether its value is greater than or equal to 85 and, if so, includes it in the new `DataFrame`.
* Grades for which the condition is `False` are represented as **`NaN` (not a number)** in the new `DataFrame
* `NaN` is pandas’ notation for missing values
* Can change NaN using the **other** keyword argument.
* where() returns a new DataFrame
    * If you want to make sure the changes are reflected in the original df, set **inplace=True**

### 3.1.2. dropna() method
* Drops NaN values from DataFrames

In [None]:
df85 = df85.dropna()

In [None]:
df85

* We don't have to use where() all the time

In [None]:
df85 = df[df["Potion_Ave"] >= 85]

In [None]:
df85

In [None]:
df85 = df[df["Charm_Ave"] >= 85]

In [None]:
df85

#### Let's return a list of students whose average score is 85 or higher in **both** Potions and Charms
* Pandas Boolean indices combine multiple conditions with the Python operator `&` (bitwise AND), _not_ the `and` Boolean operator 
* For `or` conditions, use `|` (bitwise OR)
* Each boolean mask needs to be embraced in parentheses

In [None]:
df85 = df[(df["Potion_Ave"] >= 85) & (df["Charm_Ave"] >= 85)]

In [None]:
df85

#### Let's return a list of students whose average score is 85 or higher in either Potions or Charms

In [None]:
df85 = df[(df["Potion_Ave"] >= 85) | (df["Charm_Ave"] >= 85)]

In [None]:
df85

#### How would you modify the previous code snippet to retrieve a dataframe of students scoring greater than, or equal to, class average in both classes?

In [None]:
df["Potion_Ave"].mean()

In [None]:
df["Charm_Ave"].mean()

* Put these in the previous chained comparison expresion:
```python
df85 = df[(df["Potion_Ave"] >= 85) & (df["Charm_Ave"] >= 85)]
```

* Replace 85 with `df["Charm_Ave"].mean()` and `df["Potion_Ave"].mean()`

In [None]:
new_df = df[(df["Potion_Ave"] >= df["Potion_Ave"].mean()) & (df["Charm_Ave"] >= df["Charm_Ave"].mean())]

In [None]:
new_df

* **Retrieve the number of students satisfying the previous criteria:**

In [None]:
len(new_df) #number of students

* **Retrieve a list of student names satisfying the previous criteria:**

In [None]:
list(new_df.index)

* Alternatively, if you want to do everything in one line

In [None]:
overachievers = list(df[(df["Potion_Ave"] >= df["Potion_Ave"].mean()) & (df["Charm_Ave"] >= df["Charm_Ave"].mean())].index)

In [None]:
overachievers

<center><img src="res/df.png" /></center>

<a id="4"></a>


## 4. Data Cleaning Example
* No need to store the columns you are not going to use during analysis
* Can select a subset of columns and create a new DF containing those columns
* Then write this DF to a file

In [None]:
df

In [None]:
columns_needed = ["House", "Potion_Ave", "Charm_Ave"]

In [None]:
final_df = df[columns_needed]

In [None]:
final_df

In [None]:
final_df.to_csv("res/ave_grades.csv")

<a id="6"></a>