# MIDTERM REVIEW

## Midterm : 

Exam Timing and Logistics: 
- Exam window is Wednesday, February 10 at 11:59pm PST through Thursday, February 11 at 11:59pm PST
- Once started, you'll have 1.5 hours to complete it
- The exam is taken on Gradescope
- multiple choice, true/false, filling in a numerical answer, and long answer questions
- Open-book, open-notes, open-internet (BUT NO STUDENT COLLABORATION)

Best way to study:
- do the project! (great for studying, plus it's due Saturday, February 13th)
- Old homeworks, labs, discussions, review lectures (the exam covers lectures 1-14)

*Check out the post on campuswire for more details!*

Here are links to the course [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html) or the helpful [reference sheet](https://ucsd-ets.github.io/dsc10-2020-fa/published/default/reference/babypandas-reference.pdf) we often use.

<img src="data/panda_relax.jpg" width="500">

## Some Topics to study for the exam 
#### (this list is not exhuastive)

- News articles and randomized controlled trials (similar to HW1)
- Understanding and working with the index of a table
- Strategies for extracting information from a table (knowing how to combine different table functions to get out the desired information)
- Interpreting the output of code, including table manipulations
- Knowing when to use different types of visualizations
- Density histograms (calculating height, area, count, and percent)
- Galton’s method for prediction
- Probability (similar to Lecture 12)
- Sampling schemes (deterministic, probabilistic, sample of convenience)
- Empirical distributions and probability distributions

In [None]:
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Law of Averages and Estimating Probabilities

### Rolling a die $N$ times

### Discussion Question

If you roll a die 4 times. What's P(at least one 6)?

|Option|Answer|
|---|---|
|A| $5/6$|
|B| $1-5/6$|
|C| $1-(5/6)^4$|
|D| $1-(1/6)^4$|
|E| None of the above|


### Answer for 4 rolls
* P(at least one 6) = 

### Answer for N rolls
* P(at least one 6) = 

### Plot the true distribution for each N

In [None]:
# chance of getting at least one six 
rolls = np.arange(1, 51)
at_least_one = bpd.DataFrame().assign(n_rolls=rolls, chance=1-(5/6)**rolls)
at_least_one.plot(kind='scatter', x='n_rolls', y='chance')

### Simulate the probability for N=20
* What is the chance of getting at least one 6 in 20 rolls?

In [None]:
faces = 
outcomes =  # pick random number from faces, 20 times
outcomes

In [None]:
# number of positive outcomes


### Run this simulation 100,000 times

In [None]:
rolled6 = 0
trials = 100000

        
#estimate the probability
rolled6/trials

### Simulate the probability for N=20
* wrap the experiment in a function that takes the number of trials as the input
* run the experiment many times

In [None]:
def roll_20(trials):
    rolled6 = 0
    for i in np.arange(trials):
        outcomes = np.random.choice(faces, 20)
        if np.count_nonzero(outcomes == 6) >=1:
            rolled6 = rolled6 + 1

    return rolled6/trials

roll_20(1000)

In [None]:
estimates = np.array([])
for i in np.arange(500):
    estimates = np.append(roll_20(1000), estimates)
    
probs = bpd.DataFrame().assign(estimates=estimates)

In [None]:
probs.plot(kind='hist', density=True, bins=np.linspace(.95, .99, 15))
true_prob = 1 - (5/6)**20

plt.axvline(x=true_prob, c='r');

## Approximately what number of trials had the probability between .965 and .970?

In [None]:
probs.plot(kind = 'hist', density = True, bins = 5)

## MBA Player Data

In [None]:
df = bpd.read_csv('data/player_data.csv')
df

---

# Top Ten Table Patterns

## And some variations

Let's look at the most common patterns we have been using on tables. They are quite simple when you have a computer. 

However, for the exam, you really need to get familiar with them.

Best way to study: Study by writing code with pen and paper. Learn to check your code for logical and syntax errors, without the help of Python!

# 0) Get and Drop Columns

**Pattern**: `df.get(column_name)`

**Pattern**: `df.drop(columns = column_name)`

Where column_name is a string

### What is the output type of the following line of code?

In [None]:
df.get('Points')

In [None]:
df.get('Age')

### What will the variable df_modified contain after running the following line of code?

### Drop column "Age"

In [None]:
df_modified = df.drop("Age")

In [None]:
help(bpd.DataFrame.drop)

In [None]:
df_modified = df.drop(columns = "Age")
df_modified

In [None]:
df_modified = df.drop(columns = ["Age", "Team", "Games"])
df_modified

# 1) Get something by its label & index
**Pattern**: `df.get(column_name).loc[row_label].`

### Getting data by its label

In [None]:
df = df.set_index('Name')

### What does the following line of code return?

In [None]:
df.get('Points').loc['LeBron James']

### What does the following line of code return?

In [None]:
df.get('Games').loc['Chris Paul']

### Getting Multiple datapoints by their labels

Get the points of players James Harden, Stephen Curry, Adreian Payne

### What is the output type of the last line of code?

In [None]:
query_players = ["James Harden", "Stephen Curry", "Adreian Payne"]
df.get('Points').loc[query_players]

# 2) Find the label with the largest/smallest value.

**Pattern**: `df.sort_values(by = "Points").iloc[-1]`

**Pattern**: `df.sort_values(by = "Points", ascending = False).iloc[0]`

### According to score, get the point and name of best player

In [None]:
df.get("Points").sort_values()

### What will occur when running the line of code below?
- A. df will be a dataframe containing the sorted values of the original dataframe df in **ascending order**
- B. df will be a dataframe containing the sorted values of the original dataframe df in **descending order**
- C. df will be a NoneType
- D. df will not get set because an error will occur

In [None]:
df = df.sort_values()

In [None]:
df = df.sort_values(by = "Points")

### What information does the following line of code give us?

In [None]:
df.get('Points').iloc[0]

### What information do the following lines of code give us?

In [None]:
df.get('Points').iloc[-1]

In [None]:
df.index[-1]

## Current state of the dataframe

In [None]:
df

### What information is output below?

In [None]:
df = df.sort_values(by = "Age", ascending = False)
df.get("Age").iloc[4]

In [None]:
df.index[4]

### What is the type of the following output, and what information does it contain?

In [None]:
query_range = np.arange(0, 6)

(df
    .get('Age')
    .sort_values(ascending = True)
    .iloc[query_range]
)

### What is the output below?

In [None]:
df = df.sort_values(by = ["Age", "Points"]) 
df.index[-1]

### Get the age and points of this player

In [None]:
df.get("Age").iloc[-1]

In [None]:
df.get("Points").iloc[-1]

---

# 3) Compute a statistic for a subset. Filter to get the subset.

**Example**: Players info for players with age >= 30

**Pattern**:

`bool_mask = df.get('Age') >= 30
df[bool_mask]
`

### Return a table containing entries for players with age >= 30

In [None]:
bool_mask = df.get('Age') >= 30
df[bool_mask]

### What is the output of the last line of code below?

In [None]:
bool_mask = df.get('Age') >= 30
df[bool_mask].get('Points').mean()

### What is the output of the last line of code below?

In [None]:
mean_points = df.get('Points').mean()
bool_mask = df.get('Points') >= mean_points
df[bool_mask].get('Age').mean()

# 4) Combining Conditions, Filtering and Getting Statistics

**Pattern**:

`bool1 = df.get('col1') > num1
bool2 = df.get('col2') == num2
bool_condition = bool1 & bool2
df[bool_condition]
`

**Pattern**: Don't forget the parantheses if you write it like below:

`
df[(...) & (...) & (...)]
df[(df.get('col1') > num1) & (df.get('col2') == num2)]
`

### Filter the table, players who have more than 600 assists and more than 100 steals

In [None]:
mask1 = df.get('Assists') > 600
mask2 = df.get('Steals') > 100
bool_mask = mask1 & mask2
df[bool_mask]

### What information do we obtain from the following lines of code?

In [None]:
mask1 = df.get('Rebounds') > 1000
mask2 = df.get('Blocks') > 100
bool_mask = mask1 & mask2
df[bool_mask].shape[0]

### What information do we obtain from the following lines of code?

In [None]:
mean_points = df.get('Points').mean()
mask1 = df.get('Points') >= mean_points
mask2 = df.get('Games') > 40
bool_mask = mask1 & mask2
df[bool_mask].get('Age').median()

# 5) Compute statistics for a group. 

**Pattern**:

`df.groupby(column_name).func()
`
Where func is the aggrageting function

In [None]:
df = df.reset_index()

In [None]:
df.groupby('Team').min()

### AI Horford is the youngest player on ATL? True or False

### What does the following line of code return and what is its output type?

In [None]:
df.groupby('Team').count().get(["Points"])

### What does the following line of code output?

In [None]:
df.groupby('Team').mean().get(["Games", "Points"])

### What do the values contain in the new column that was created, and what is the name of that column?

In [None]:
new_col = df.get("Points") / df.get("Games")
df_new = (df
    .assign(Points_Per_Game = new_col)
    .sort_values(by = "Points_Per_Game",ascending = False)
)
df_new

# 6) Apply function & Conditionals

**Pattern**: `df.get(a_column).apply(a_function)`

### Given a full name, write a function that finds how many words it has

In [None]:
def find_name_len(string):
    """ Finds how many words the name contains """
    return len(string.split())

In [None]:
find_name_len("Frank Lloyd Wright")

In [None]:
find_name_len("Tony Montana")

### What does the following line of code output?

In [None]:
df.get("Name").apply(find_name_len)

### What information does this code output?

In [None]:
( df
 .reset_index()
 .get("Name")
 .apply(find_name_len)
 .max()
)

In [None]:
def assign_age_group(age):
    if age < 21:
        return "young"
    elif age < 31:
        return "mid"
    else:
        return "old"

### What will the line of code below output?

In [None]:
assign_age_group(21)

In [None]:
assign_age_group(35)

### Add a new column to the table, which shows the age group of each player

In [None]:
new_col = df.get("Age").apply(assign_age_group)
df_new = df.assign(Age_Group = new_col)
df_new

# 7) Groupby Multiple Columns and look at statistics


**Pattern**:

`df.groupby([column_name1, column_name2]).func()
`
Where func is the aggrageting function

* There should always be an aggregating function. Otherwise we just get a groupby object.

### What does the following line of code output?

In [None]:
df_new.groupby(["Team", "Age_Group"])

### What does the following line of code output?

In [None]:
df_groups = (df_new
 .groupby(["Team", "Age_Group"]).count()
 .get(["Team", "Age_Group", "Games"])
)

In [None]:
df_groups = (df_new
 .groupby(["Team", "Age_Group"])
 .count()
 .reset_index() # critical change here bc Age_Group was the old index!
 .get(["Team", "Age_Group", "Games"])
)
df_groups

# 8) Rename a column

**Pattern**: Store the column to be renamed, assign it to with a new column name, drop the column with the old name.

`new_col = df.get(old_column_name)
df = (df.assign(new_col_name = new_col).drop(columns = old_column_name))`

In [None]:
df_groups

### What are the columns in the output?

In [None]:
new_col = df_groups.get("Games")
df_groups = (df_groups
             .assign(Player_Count = new_col)
             .drop(columns = "Games")
            )
df_groups

# 9) Get all rows containing a string.

**Pattern**

`bool_mask = df.get(column_of_strings).str.contains('James')
df[bool_mask]
`

### What does the outputted dataframe contain?

In [None]:
bool_mask = df.get("Name").str.contains("James")
df[bool_mask]

### Only players with the substring "Reg" and substring "ie" in their full name remain.

In [None]:
mask1 = df.get("Name").str.contains("ie")
mask2 = df.get("Name").str.contains("Reg")
df[mask1 & mask2]

# Top 8 Possible Pitfalls & Things to Keep in Mind

## 0) Difference between & and "and"

Always use "and" with conditionals, always use & with boolean arrays.

In [None]:
True and False

In [None]:
np.array([True, False, True]) & np.array([False, False, False])

In [None]:
np.array([True, False, True]) and np.array([False, False, False]) # don't do this!

## 1) Parentheses when combining conditionals:

In [None]:
df[df.get("Age") >= 25 & df.get("Points") >= 2000]

In [None]:
df[(df.get("Age") >= 25) & (df.get("Points") >= 1800)]

## 2) Column names are meaningless after a `groupby` and count!

In [None]:
df.groupby("Team").count()

In [None]:
# Has no relation to the actual "Steals" and "Blocks" columns
df.groupby("Team").count().get(["Steals", "Blocks"])

## 3) Reset index, especially after grouping with multiple columns.

In [None]:
(df_new
 .groupby(["Team", "Age_Group"])
 .count()
 .reset_index()
)

## 4) `iloc[]` vs `loc[]` vs array indexing`[]`

In [None]:
# Before using loc, make sure of what type of index you have:
df.index

In [None]:
df = df.set_index("Name")

In [None]:
df.get("Age").loc["Stephen Curry"]

In [None]:
df.get("Age").iloc[2]

In [None]:
df.index[2]

## 5) Not specifying column while sorting table

Wrong: `df = df.sort_values(ascending = False)` 

Correct: `df = df.sort_values(by = column_name, ascending = False)` 

## 6) Trying to get the index using .get() instead of .index

In [None]:
# df.get("Name")
df.index

# 7) Using df.drop with missing argument

`df.drop(columns = column_name)` without columns, for example `df.drop(column_name)` is wrong.

In [None]:
# df.drop("Points")
df.drop(columns = "Points")