# CMPINF 2100: Homework 6

## Instructions

### Assignment Overview

The DataFrame is a table-like object that is critically important in data analysis. It allows us to easily store many variables associated with an application and explore them. We must learn about the key attributes and methods associated with DataFrames before we can use them to explore and ultimately model data.

This assignment is mostly focused on Pandas. You will work with the major attributes associated with Pandas Series and DataFrames. You will slice/subset Series and DataFrames to practice manipulating data.

However, you will begin by reviewing important attributes associated with NumPy arrays. That is because the Pandas DataFrame is built on top of NumPy. You will practice reshaping 1D NumPy arrays into 2D NumPy arrays to review the differences between rows and columns.

As with previous notebooks, you are free to add any additional code or markdown cells as needed.


### Important (please review)

1. __Please ignore any empty cells you encounter in this notebook.__  


2.  __For any cell that contains the line `raise NotImplementedError` please remove the line `raise NotImplementedError` and replace that with your solution code.__  


#### Empty Cells

You will see a number of empty cells throughout this notebook.  These cells appear after cells that will be auto-graded upon assignment submission.  Please do not attempt to remove or modify these cells as that will impede the auto-grading process.  You can just ignore these cells when you encounter them.

   

#### NotImplementedError Statements

You will see a number of `raise NotImplementedError` statements throughout the notebook. These statements are placeholders the auto-graded uses to indicate places where you will add your code and that code will specifically be code that is auto-graded.  Please make sure to completely *remove* any line that matches the following: `raise NotImplementedError` and replace that with your code solution.  Failure to remove a `raise NotImplementedError` line will cause the auto-grader to failure even if you provided your solution for that given question.

## Setup and Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
x1D = np.arange(1,31); #Numpy 1D array from 1 to 30
x2Da = x1D.reshape(6,5); #Numpy 2D array of shape(6 x 5)
x2Db = x1D.reshape(10, -1) # NumPy 2D array of shape (10x3). -1 tells NumPy to infer the missing dimension automatically
print(x1D)
print(x2Da)
print(x2Db)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30]
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]
 [21 22 23 24 25]
 [26 27 28 29 30]]
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]
 [13 14 15]
 [16 17 18]
 [19 20 21]
 [22 23 24]
 [25 26 27]
 [28 29 30]]


## Problem 1

Let's now practice working with Pandas Series.

### 1a)

Convert the LAST **row** in `x2Da` to a Pandas Series and assign the result to the `series_2a`. You may use the default `index` argument when creating the Pandas Series.

#### 1a) - SOLUTION

In [3]:
series_2a = pd.Series(x2Da[5])
series_2a

0    26
1    27
2    28
3    29
4    30
dtype: int32

### 1b)

Print the LAST row in `x2Da` NumPy array to the screen. Then print the `series_2a` object to the screen.

Describe how the print displays look different within a Markdown cell.

#### 1b) - SOLUTION

In [4]:
print(x2Da[-1])
print(series_2a)

[26 27 28 29 30]
0    26
1    27
2    28
3    29
4    30
dtype: int32


The Numpy array x2Da print on the horizontal, while the Pandas Series print on the vertical.  

### 1c)

Print the `.index` attribute for the `series_2a` object to the screen.

Do you see the `.index` attribute displayed in the `series_2a` print out shown previously in 2b)? Type your response in a Markdown cell.

#### 1c) - SOLUTION

In [5]:
series_2a.index

RangeIndex(start=0, stop=5, step=1)

Yes, the index is shown on exercise 2b. 

### 1d)

We can modify the `.index` attribute of a Series by assigning a new value to it. In this problem, you must create a list of `str`'s. That list must contain a sequential list of lower case English alphabetical letters starting with `'a'`. The list must be the same length as the length of the `series_2a` object.

For example, if `series_2a` consisted of 3 elements the list should contains the letters, `'a',` `'b'`, and `'c'`.  

Assign the list of letters to the `my_letter_index` object.

Assign the `my_letter_index` to the `.index` attribute of `series_2a`.

Print the `series_2a` object to the screen.

#### 1d) - SOLUTION

In [6]:
my_letter_index = ["a", "b", "c", "d", "e"]
series_2a.index = my_letter_index
series_2a


a    26
b    27
c    28
d    29
e    30
dtype: int32

### 1e)

Let's now subset `series_2a` using the `.index`. Display the value associated with the `.index` location key `'c'` to the screen.

#### 1e) - SOLUTION

In [7]:
series_2a["c"]

28

### 1f)

Display the value associated with the last `.index` location key to the screen.

#### 1f) - SOLUTION

In [8]:
series_2a[-1]

  series_2a[-1]


30

### 1g)

Reset the `.index` attribute of `series_2a` to the original range index.

Display the `.index` attribute to the screen to confirm it has been changed back.

#### 1g) - SOLUTION

In [10]:
series_2a = series_2a.reset_index(drop=True)
series_2a.index

RangeIndex(start=0, stop=5, step=1)

## Problem 2

The Pandas Series builds on top of the 1D NumPy array. The Pandas DataFrame builds on top of the 2D NumPy array. The DataFrame is a critical data type in many data analysis applications. We will use it extensively for the remainder of CMPINF 2100.

Let's now practice creating DataFrames!

### 2a)

Convert the 2D NumPy array `x2Db` to a Pandas DataFrame. Assign the result to the `df1` object. You may use the default `index` and `columns` arguments when creating the DataFrame.

#### 2a) - SOLUTION

In [None]:
df1 = 

### 2b)

Use the `.info()` method to display basic information associated with `df1`.

What are the names of the columns in `df1` and their data types? Type your response in a Markdown cell.

#### 2b) - SOLUTION

In [None]:
# type your code here

### 2c)

Let's convert the `x2Db` NumPy array to a DataFrame again. This time you will specify the column names rather than relying on the default names.

Let's begin by defining a List that stores the desired columns.

Create a list of sequential upper case English alphabetical letters as `str` data types. The list must start with `'A'`. The length of the list must be the same as the number of columns in `x2Db`.

Assign the list to the `my_column_names` object.

#### 2c) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 2d)

Convert the 2D NumPy array `x2Db` to a Pandas DataFrame. Assign the result to the `df2` object. You may use the default `index` argument. You must specify the `columns` argument such that the values in `my_column_names` are assigned as the column names.

#### 2d) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 2e)

Use the `.info()` method to display basic information associated with `df2`.

What are the names of the columns in `df2` and their data types? Are they different from those in `df1`? Type your response in a Markdown cell.

#### 2e) - SOLUTION

In [None]:
# type your code here

## Problem 3

Let's now practice selecting columns and rows within DataFrames! You will work with the `df2` object defined in the prior problem.

### 3a)

Let's begin by displaying all rows associated with the `'B'` column to the screen.

There are several ways to accomplish this. You **must** use **2 different** ways of selecting the `'B'` column. The result can be either a Pandas Series object or DataFrame for this problem.

#### 3a) - SOLUTION

In [None]:
# type your code here

### 3b)

Next, display all rows associated with the `'A'` and `'C'` columns to the screen.

You must accomplish this procedure by using the column **names** and NOT the column position index. You only need to do this once.

#### 3b) - SOLUTION

In [None]:
# type your code here

### 3c)

Next, display **all columns** associated with the LAST **row** within `df2`.

You must use the appropriate attribute that lets you access the rows and columns based on the **integer position index**.

#### 3c) - SOLUTION

In [None]:
# type your code here

### 3d)

Let's modify `df2` now by sorting the rows based on the `'A'` column.

Sort `df2` such that the rows are in DESCENDING order based on the `'A'` column. You **must** modify `df2` in place. You **must** ignore the index.

#### 3d) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 3e)

Lastly, display all columns associated with the first (zeroth) row within `df2`.

You must use the appropriate attribute that lets you access the rows and columns based on the **integer position index**.

#### 3e) - SOLUTION

In [None]:
# type your code here

### 3f)

Are the displayed results from 4e) different than those displayed in 4c)?

Why or why not? Type your response in a Markdown cell.

#### 3f) - SOLUTION

In [None]:
# type your response here

## Problem 4

This problem will give you further practice working with DataFrames. You will use data associated with games from the Pittsburgh Pirate's 2022 season. The data are entered within a Dictionary for you below.

In [None]:
pirates_dict = {'Month': 12 * ['April'],
                'Day': [7, 9, 10] + list( range(12, 21) ),
                'Away_Team': 3 * ['Pirates'] + 2 * ['Cubs'] + 4 * ['Nationals'] + 3 * ['Pirates'],
                'Home_Team': 3 * ['Cardinals'] + 2 * ['Pirates'] + 4 * ['Pirates'] + 3 * ['Brewers'],
                'Away_Score': [0, 2, 9, 2, 2, 4, 7, 4, 3, 1, 2, 2],
                'Home_Score': [9, 6, 4, 1, 6, 9, 2, 6, 5, 6, 5, 4],
                'Winning_Team': 2 * ['Cardinals'] + ['Pirates'] + ['Cubs'] + 2 * ['Pirates'] + ['Nationals'] + 2 * ['Pirates'] + 3 * ['Brewers'],
                'Winning_Pitcher': ['Wainwright', 'Whitley', 'Yajure', 'Smyly', 'Peters', 'Contreras', 'Fedde', 'Peters', 'Hembree', 'Lauer', 'Burnes', 'Woodruff'],
                'Losing_Pitcher': ['Brubaker', 'Keller', 'Matz', 'Quintana', 'Hendricks', 'Adon', 'Keller', 'Rogers', 'Cishek', 'Thompson', 'Brubaker', 'Keller']}

### 4a)

How many KEY/VALUE pairs or ITEMS are contained within the `pirates_dict`?  Save the value associated with `len(pirates_dict)` to the variable `dlen`.

#### 4a) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 4b)

Convert the `pirates_dict` to a Pandas DataFrame and assign the result to the `pirates_df` object. You may use the default `index` and `columns` arguments when creating the DataFrame.

Use the appropriate attribute to display the number of rows and columns associated with `pirates_df` to the screen.

#### 4b) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 4c)

Use the appropriate attribute to the display the DATA TYPES for each COLUMN within `pirates_df` to the screen.

Are all columns the same data type? Type your response in a Markdown cell.


#### 4c) - SOLUTION

In [None]:
# type your response here

### 4d)

Use the `.info()` method to display the basic information associated with `pirates_df` to the screen.

Is the information consistent with what you displayed previously? Type your response in a Markdown cell.

#### 4d) - SOLUTION

In [None]:
# type your code here

### 4e)

Your `pirates_df` DataFrame contains two columns that record the month and day a game was played on. However, it does not record the year! Every game contained in `pirates_df` was played in 2022.

Add a column to `pirates_df` named `'Year'` that stores the year associated with all games in `pirates_df`.

#### 4e) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 4f)

Use the appropriate attribute to display the number of rows and columns associated with `pirates_df` to the screen.

Is the number of columns different from what you displayed in 5b)?

#### 4f) - SOLUTION

In [None]:
# type your code here

## Problem 5

Let's now practice selecting rows or as I prefer to call it, FILTERING, the data!

You will continue to work with the `pirates_df` object in this problem.

### 5a)

Filter `pirates_df` to select all rows where the `Home_Score` is GREATER THAN 6.

Save the result as the variable `pirates_df_6a`.

Display all columns with the filtered rows to the screen.

#### 5a) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 5b)

Filter `pirates_df` to select all rows where `Away_Score` is LESS THAN 3.

Do NOT display all columns this time. Only display the `Home_Team`, `Away_Team`, and `Winning_Team` columns with the filtered rows.

Save the final result as `pirates_df_6b`.

#### 5b) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 5c)

Let's now filter based on MULTIPLE conditions.

Filter `pirates_df` to select all rows where the `Home_Score` is GREATER THAN 6 **and** the Pirates are the `Winning_Team`.

Save the result as `pirates_df_6c`

Display all columns with the filtered rows to the screen.

#### 5c) - SOLUTION

In [None]:
# your code here
raise NotImplementedError

### 5d)

Filter `pirates_df` to select all rows where the `Away_Score` is LESS THAN 3 **and** the Pirates are the `Winning_Team`.

Save the result as the variable `pirates_df_6d`

Display all columns with the filtered rows to the screen.

#### 5d) - SOLUTION

In [None]:
# your code here
raise NotImplementedError