**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 15. Creating New Variables in Pandas

## In this lesson...

- More ways to create new variables in Pandas

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## A tiny example dataset

* Let's start by importing Pandas:

In [None]:
import pandas as pd

* For this lesson, we'll use the example dataset in `data/sample.csv` to illustrate how the different ways of creating new variables work

In [None]:
df = pd.read_csv('data/sample.csv')
df

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## The assign method

* In a previous lesson, we learned that we can create a new variable/column in a DataFrame by *direct assignment*, like this:

In [None]:
df['e'] = df['a'] + df['b']
df.head()

* Note that the direct assignment approach makes the changes *in place*; that is, `df` itself is modified

- Here's another way to create a new variable/column in a DataFrame


- Let's read in the data again and start from scratch:

In [None]:
df = pd.read_csv('data/sample.csv')

- First, we create a function that takes a DataFrame as input, and outputs the sum of columns `a` and `b` of the DataFrame, like this:

- Second, we use the `.assign()` DataFrame method to create the new variable
    - Each keyword argument `key=function` creates a new variable called `key` equal to the output of `function(df)`

- So, to create a new variable `e` that is equal to the sum of columns `a` and `b`, we can do this:

* Note that `.assign()` *returns a new DataFrame*
    - It does not make the changes in place, like the direct assignment approach above

* If we inspect the contents of `df` now, we see that it doesn't contain a column named `e`:

In [None]:
df

* Now... creating a separate function to perform the operation we want is a bit cumbersome


* Instead, we can use a **lambda function**, like this:

- The lambda function

    ```python
    lambda x: x['a'] + x['b']
    ```

    <br>takes `x` as input, and outputs `x['a'] + x['b']`, just like we defined in the function `add_a_and_b()`


- We can even create multiple variables in a single call of `.assign()`, like this:

- In the example above, note that we can even refer to *newly created* columns within the same call of `.assign()`

- **Template for using the `.assign()` method**

    ```python
    df.assign(
        new_variable_1=lambda x: ...,
        new_variable_2=lambda x: ...
    )
    ```

    <br>where `x` refers to the DataFrame `df`

### Why use the assign method?

- The direct assignment approach makes changes to the DataFrame in place, while `.assign()` returns a new DataFrame


- As a result, the `.assign()` method supports method chaining, like this:

In [None]:
(
    df
    .assign(e=lambda x: x['c'] / 2)
    .query('e > 5')
)

- *Refactoring* &mdash; in particular, renaming variables &mdash; is easier as well 


- As a comparison, think about changing the name of the DataFrame from `df` to `another_df` in these 2 cases:

    ```python
    # Case 1
    df['f'] = df['a'] + df['b'] + df['c'] + df['d']
    
    # Case 2
    df.assign(
        f=lambda x: x['a'] + x['b'] + x['c'] + x['d']
    )
    ```

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Useful Series and DataFrame methods for computations

- Pandas has a myriad of built-in methods for various computations


- In this lesson, we'll learn about some of the basic ones


- To find the documentation for the Series and DataFrame methods covered in this section, [search the documentation](https://pandas.pydata.org/pandas-docs/stable/index.html) for either
    - `pandas.Series.method_name` or
    - `pandas.DataFrame.method_name`

### Element-wise methods

* **Element-wise methods** perform the same computation on each element of a Series or DataFrame


* One such example is the `.round()` Series/DataFrame method


* When applied to a Series, we get the values in the Series, rounded to the given number of decimals


* For example:

* We can add a column to our DataFrame with these rounded values, like this:

* We can also apply `.round()` to the entire DataFrame:

- A few mathematical element-wise methods in Pandas:

| Method | Description |
| :- | :- |
| `.round()` | Round values to given number of decimal places | 
| `.abs()` | Absolute values |


- For more complex element-wise mathematical computations, you can use the appropriate [NumPy](https://numpy.org/) functions


- For example, we can compute the cosine of the values in column `b` like this:

- [Here's a list of NumPy mathematical functions](https://numpy.org/doc/stable/reference/routines.math.html)

❓ __Exercise 1.__ Add a column to `df` called `log10_c`, which contains the base 10 logarithm of the values in column `c`.

*Hint.* Take a look at the list of NumPy mathematical functions linked above.

### Reduction methods

- **Reduction methods** either
    - map a Series to a single value, or
    - map the rows or columns of a DataFrame to a Series

- One such example is the `.sum()` Series/DataFrame method


- When applied to a Series, we get the sum of the values in the Series:

- By default, when applied to a DataFrame, `.sum()` returns a Series containing sums *across the rows*, or within each column:

- To sum *across the columns* instead, we can use the `axis='columns'` keyword:

- We can select a subset of columns first, and then sum across the columns, like this:

* We can add a column to our DataFrame with this sum, like this:

*Food for thought.* What happens when you use the code above, but with `axis='rows'` instead? Why does this happen?

- Here are some useful reduction methods:

    
| Method | Description |
| :- | :- |
| `.count()` | Number of non-NA values |
| `.min()`, `.max()` | Compute minimum and maximum values |
| `.quantile()` | Compute sample quantile ranging from 0 to 1 |
| `.sum()` | Sum of values |
| `.mean()` | Mean of values |
| `.median()` | Median of values |
| `.mad()` | Mean absolute deviation from mean value |
| `.prod()` | Product of all values |
| `.var()` | Sample variance of values |
| `.std()` | Sample standard deivation of values |


- All the methods above use the `axis=...` keyword argument in the same way

❓ __Exercise 2.__ Add two new columns to `df` called `min_value` and `max_value`, containing the minimum and maximum value in each row, respectively.

### "Same size" methods

* There are other methods that
    - take a row or column of values
    - perform computations on these values *as a group* (not element-wise)
    - return a row or column of values *of the same size*

* For example, we can obtain the cumulative sum of values across the *columns* with the `.cumsum()` method, like this:

* We can also obtain the cumulative sum of values across the *rows* of a column with the `.cumsum()` method, like this:

* As we have done above, we can add this cumulative sum of values for column `c` as a new column of our DataFrame, like this:

* Here are some methods that behave similarly &mdash; that is, they
    - take a row or column of values
    - perform a computation on these values as a group
    - then return a row or column of values of the same size

| Method | Description |
| :- | :- |
| `.cumsum()` | Cumulative sum of values |
| `.cummin()`, `.cummax()` | Cumulative minimum and maximum of values |
| `.cumprod()` | Cumulative product of values |
| `.diff()` | Compute difference between consecutive rows/columns |
| `.pct_change()` | Compute percent changes between consecutive rows/columns |
| `.rank()` | Compute numerical ranks | 

* Some of these methods take optional keyword arguments that change what they compute 
    * Take a look at the documentation for details

❓ __Exercise 3.__ Add a column called `a_diff` to `df`, containing the difference between consecutive values in column `a`. 

You should see that the first element of your newly created column `a_diff` has a missing value. Does this make sense?

*Write your answer here. Double-click to edit.*

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Other useful ways to create new variables

### Adding a constant column

* Sometimes it's useful to add a column in which every value is equal to a constant, either a string or numeric value


* In this case, you don't need to use a lambda function, like this:

### Mapping values in one column to another

* Suppose we know that the values in column `c` in `df` range between 10 and 20, and we want to create a column with those values in words

* We can create a dictionary with our desired mapping, like this:

In [None]:
number_in_words = {
    10: 'Ten',
    11: 'Eleven',
    12: 'Twelve',
    13: 'Thirteen',
    14: 'Fourteen',
    15: 'Fifteen',
    16: 'Sixteen',
    17: 'Seventeen',
    18: 'Eighteen',
    19: 'Nineteen',
    20: 'Twenty'
}

* Then, we can use the `.map()` method to create our new column, like this:

* We can also pass a _function_ to `.map()` to accomplish something similar!

* For example, suppose we want to create a new column based on column `b` in `df`:
    * If the value in `b` is below 6, we want the value in our new column to be `low`
    * Otherwise, we want the value in our new column to be `high`

    
* Let's first define a function that creates this mapping:

In [None]:
def low_or_high(value):
    if value < 6:
        return 'low'
    else:
        return 'high'

* Now we can use the `.map()` method to create our new column, like this:

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

### Problem 0

In the same folder as this notebook, there is a CSV file `data/grades.csv`. This is the same dataset used for the problems in Lesson 12. Read the CSV file into a DataFrame. Display the top 5 rows of the DataFrame.

### Problem 1

Find the minimum, maximum, and median quiz and exam grades. Write your code so that the student IDs and names are *not* included in the output.

*Hint.* Use the `.drop()` DataFrame method.

### Problem 2

Create a new DataFrame based on the one you created in Problem 0:

- Compute each student's quiz grades as a percentage (fraction between 0 and 1). The maximum score on Quizzes 1 and 3 is 20 points; the maximum score on Quizzes 2 and 4 is 30 points. Add 4 columns, one for each quiz as a percentage.


- Compute each student's exam grades as a percentage. The maximum score on each exam is 100 points. Add 2 columns, one for each exam as a percentage.


- For each student, compute the average of their quiz percentage grades. Add a column containing the quiz averages.


- For each student, compute the average of their exam percentage grades. Add a column containing the exam averages.


- Compute each student's course grade as a weighted average of each student's quiz average and exam average: quizzes are worth 35\%, exams are worth 65\%. Add a column containing the course grade.


Do this using a *single* call to `.assign()`, and use as many of the Series/DataFrame methods described above (e.g. `.sum()`, `.mean()`, etc.) as required. 

Check your work by displaying the top 5 rows of the DataFrame. You should find that Romeo Conway has a course grade of about 78.8%, Naseem Livingston has a course grade of 70.4%, and Remy Clark has a course grade of 77.5%.

### Problem 3

Create a new DataFrame based on the one you created in Problem 2. In particular, add a new column containing the rank of each student based on their course grade. The student with the highest course grade should have a rank 1. Your new DataFrame should *only* contain the student ID, last name and first name, course grade, and rank. The first 5 rows of your new DataFrame should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>student_id</th>
      <th>lastname</th>
      <th>firstname</th>
      <th>course_grade</th>
      <th>rank</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2603</td>
      <td>Conway</td>
      <td>Romeo</td>
      <td>0.787958</td>
      <td>8.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>6435</td>
      <td>Livingston</td>
      <td>Naseem</td>
      <td>0.703708</td>
      <td>30.0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>6754</td>
      <td>Clark</td>
      <td>Remy</td>
      <td>0.775000</td>
      <td>10.0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>3032</td>
      <td>Carpenter</td>
      <td>Zubair</td>
      <td>0.754000</td>
      <td>14.0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2715</td>
      <td>Guerra</td>
      <td>Samantha</td>
      <td>0.731167</td>
      <td>19.0</td>
    </tr>
  </tbody>
</table>

*Hint.* Read the documentation for `.rank()`.