# SI 618: Data Manipulation and Analysis
## 02 - Introduction to pandas
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
    
Version 2023.01.09.1.CT


## Objectives:
* Know how to manipulate Series and DataFrame
* Draw a random sample of data
* Select subset of data using boolean masking
* Compute descriptive and summary statistics
* Sort a DataFrame by index or column
* Group data and calculate aggregate statistics
* Make basic plots (scatter plot, histogram, bar chart, etc)

## Submission Instructions:
Please turn in this Jupyter notebook file (both .ipynb and .html formats) on Canvas before the end of class.

## Points
All questions are worth a maximum of two points.

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [None]:
MY_UNIQNAME = '?'

## NumPy

Let's set up a couple of plain old python lists

In [None]:
names = ['Alphonso','Beata','Cal','Din','Ella']
scores = [3,5,4,4,5]

### <font color="magenta">Q1: Write code to iterate through the two lists to produce the following output:</font>
```
Alphonso has a score of 3.
Beata has a score of 5.
Cal has a score of 4.
Din has a score of 4.
Ella has a score of 5.
```
Do not import any additional packages (yet).

In [None]:
# insert your code here

## NumPy

In [None]:
import numpy as np

In [None]:
ar_names = np.array(names)
ar_names

### <font color="magenta">Q2: Create ```ar_scores``` that contains an array of the scores from above:

In [None]:
ar_scores = np.nan

In [None]:
ar_scores

Now, let's say we wanted to modify the scores by multiplying each one by 1.25.

### <font color="magenta">Q3: Write some code that would do that using plain old python (i.e. without using pandas, numpy, etc.)</font>

In [None]:
# insert your code here

## ufuncs

We can use ufuncs to multiply each score by 1.25:

In [None]:
modified_scores = ar_scores * 1.25
modified_scores

In [None]:
modified_scores

### <font color="magenta">Q4: write code to create a new array called sqrt_scores that contains the square roots of each of the original scores</font>

In [None]:
# insert your code here

## pd.Series

In [None]:
import pandas as pd

In [None]:
s_names = pd.Series(names)

In [None]:
s_names

In [None]:
s_scores = pd.Series(scores)
s_scores

In [None]:
names # just to remind ourselves what names looks like

In [None]:
s_scores = pd.Series(scores,index=names)
s_scores

## pd.DataFrame

In [None]:
df = pd.DataFrame({"name": names, "score": scores})

In [None]:
df

In [None]:
specializations = ['DS', 'UX', 'UX', 'DS', 'DS']

In [None]:
df['specialization'] = specializations
df

Let's say we wanted to set the "name" column to be the index:

In [None]:
df.set_index("name")

In [None]:
df_indexed_by_name = df.set_index("name")

In [None]:
df_indexed_by_name

In [None]:
df.set_index("name",inplace = True) # equivalent to df = df.set_index("name")

In [None]:
df

In [None]:
df.describe().T

In [None]:
df.reset_index(inplace = True)
df

# Part 1 (as a group): Mental Health Disorders In the Tech Workplace
From https://www.kaggle.com/osmi/mental-health-in-tech-survey

## Data Description

This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace.

## Metadata
|Field|Description|
|:----|:----|
|**Timestamp**|
|**Age**| 
|**Gender**
|**Country**
|**state**| If you live in the United States, which state or territory do you live in?
|**self_employed**| Are you self-employed?
|**family_history**| Do you have a family history of mental illness?
|**treatment**| Have you sought treatment for a mental health condition?
|**work_interfere**| If you have a mental health condition, do you feel that it interferes with your work?
|**no_employees**| How many employees does your company or organization have?
|**remote_work**| Do you work remotely (outside of an office) at least 50% of the time?
|**tech_company**| Is your employer primarily a tech company/organization?
|**benefits**| Does your employer provide mental health benefits?
|**care_options**| Do you know the options for mental health care your employer provides?
|**wellness_program**| Has your employer ever discussed mental health as part of an employee wellness program?
|**seek_help**| Does your employer provide resources to learn more about mental health issues and how to seek help?
|**anonymity**| Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
|**leave**| How easy is it for you to take medical leave for a mental health condition?
|**mental_health_consequence**| Do you think that discussing a mental health issue with your employer would have negative consequences?
|**phys_health_consequence**| Do you think that discussing a physical health issue with your employer would have negative consequences?
|**coworkers**| Would you be willing to discuss a mental health issue with your coworkers?
|**supervisor**| Would you be willing to discuss a mental health issue with your direct supervisor(s)?
|**mental_health_interview**| Would you bring up a mental health issue with a potential employer in an interview?
|**phys_health_interview**| Would you bring up a physical health issue with a potential employer in an interview?
|**mental_vs_physical**| Do you feel that your employer takes mental health as seriously as physical health?
|**obs_consequence**| Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
|**comments**| Any additional notes or comments



Let's load the usual libraries and also ask for plots to be rendered inside the notebook:

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline



Then read the CSV file into a DataFrame:

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/umsi-data-science/data/main/survey.csv")

It's common to look at the resulting DataFrame using .head()

In [None]:
df.head()

If you want to look at a random sample, you can use .sample()

In [None]:
df.sample(5)

Finally, you can get some basic information about the size and shape of the DataFrame:

In [None]:
print("The number of rows of the dataset is: ", len(df))
print("The number of columns of the dataset is: ", len(df.columns))
print("The shape of the dataset is: ", df.shape)

You can list the columns:


In [None]:
df.columns

And you can extract one or more columns.  The following pair of 
commands do exactly the same thing:

In [None]:
print(df['Country'])

In [None]:
country_state = df[['Country', 'state']]
country_state.head()

## Extracting rows

In [None]:
df.iloc[0]

In [None]:
df.loc[0]

In [None]:
df.head(1)

In [None]:
df_gender = df.set_index('Gender')

In [None]:
df_gender.head()

In [None]:
df_gender.loc[219] #  will throw an exception


In [None]:
df.iloc['Gender'] # will throw an exception

In [None]:
import traceback
try:
    df.iloc['Gender'] # generates error
except TypeError as e:
    print(traceback.format_exc())

In [None]:
df.iloc[0]

## Sorting
You can use either sort_values() or sort_index():


In [None]:
df_sorted = df.sort_values('Age')
df_sorted.tail(10)

## Filtering using Boolean Masking

In [None]:
df.Age

In [None]:
df['Age'] > 40

In [None]:
df[df['Age'] > 0]

In [None]:
df['Age'] > 40

In [None]:
df[df['Age'] > 40]

### Example: Find people who reported a family history of mental health conditions. </font>

Solution:

In [None]:
df[df.family_history == 'Yes']

You can use a simple expression like ```df[df['family_history'] == 'Yes']``` or you can make more complex boolean expressions using parentheses: 


In [None]:
df_filtered = df[(df['family_history'] != 'No') & (df['treatment'] == 'Yes')]
df_filtered.head()

In [None]:
df.coworkers.value_counts()

### <font color="magenta">Q5: How many people are willing to discuss a mental health issue with their supervisor or their coworkers? </font>

In [None]:
# insert your code here

### <font color="magenta">Q6: Make a new DataFrame ```df_millenials``` with only millennials (born between 1976 and 1996). Make appropriate assumptions when constructing your filter. </font>

In [None]:
# insert your code here

**NOTE: We will still use df for the following analysis**

## Descriptive and Summary Statistics

Example: What is the mean age of the survey sample?

Solution:

In [None]:
df['Age'].mean()

### Does that look right?  What should we do?

In [None]:
df.sort_values('Age').tail(10)

### <font color="magenta">Q7: What is the _median_ age of the survey sample?  </font>

In [None]:
# insert your code here

### <font color="magenta">Q8: Write one line of code to compute basic statistics (mean, standard deviation, min, 25% percentile, etc) about Age  </font>

Hint: see the readings

In [None]:
# insert your code here

## Unique Values, Counts, Membership

Example: Write one line of code to check unique values of Gender</font>

Solution:

In [None]:
df.coworkers.unique()

In [None]:
df.Gender.unique()

In [None]:
df.Gender.value_counts()

Example: Write one line of code to count the occurrences of the countries and show the top 5 countries.  </font>

Solution:

In [None]:
df.Country.value_counts().head(7)

Are you sure that's correct?

### <font color="magenta">Q9: Find the unique categories of no_employees. What is the frequency of each category? </font>

In [None]:
# insert your code here

### <font color="magenta">Q10: Among the people from United States, how many repondents were there from each state?  </font>

In [None]:
# insert your code here

## Basic Plots

Example: Investigate the proportion (%) of people receiving health benefits from their employers.

Solution:

In [None]:
df.benefits

In [None]:
df.benefits.value_counts(normalize=True)

In [None]:
df.benefits.value_counts(normalize=True).plot.bar()

Example: Create a histogram of the distribution of Age values:

In [None]:
df[(df.Age < 100) & (df.Age > 15)].Age.plot.hist()

### <font color="magenta">Q11: Experiment with the number of bins in the histogram of the Age distribution.  Is there a "best" value?</font>

Hint: use the bins= option to plot()

In [None]:
# insert your code here

## Aggregation

Example: Find the number of participants from each state.

Solution:

In [None]:
df.state.value_counts()

In [None]:
df.groupby('state').size()

### <font color="magenta">Q12: Find the median age of people for each state. </font>

In [None]:
# insert your code here

# Part 2 (on your own): Exploration of Movie Titles and Movie Cast

## Time to load some data:

In [None]:
titles = pd.read_csv('https://github.com/umsi-data-science/data/raw/main/titles.csv', index_col=None)

In [None]:
titles.head()

The titles DataFrame contains a list of movie titles and release year

In [None]:
cast = pd.read_csv('https://github.com/umsi-data-science/data/raw/main/cast.zip', index_col=None)

The ```cast``` DataFrame contains the following columns 

**title** = name of movie

**year** = year of movie

**name** = name of actor/actress

**type** = actor or actress

**character** = character name

**n** = number in the credits (NaN when not available)

In [None]:
titles.head()

In [None]:
cast.sample(5)

### <font color="magenta">Q13: How many entries are there in the titles table?</font>

In [None]:
# insert your code here

### <font color="magenta">Q14: What are the two earliest movies?</font>

In [None]:
# insert your code here

### <font color="magenta">Q15: How many movies have the title "Hamlet"?</font>

In [None]:
# insert your code here

### <font color="magenta">Q16: List all of the "Treasure Island" movies from earliest to most recent.</font>

In [None]:
# insert your code here

### <font color="magenta">Q17: What are the ten most common movie names of all time?</font>

In [None]:
# insert your code here

### Stretch goals
The following questions are extra material and need not be completed as part of this
notebook.  We will, however, start next class by considering this material, so it's 
worth attempting if you have time.

### EXTRA (no points): <font color="magenta">Who are the 10 people most often credited as "Herself" in film history?</font>

In [None]:
# insert code here

### EXTRA (no points): <font color="magenta">What are the 10 most frequent roles that start with the word "Science"?</font>
Hint: read docs on str.startswith()

In [None]:
# insert code here

### EXTRA (no points): <font color="magenta">Comment on the differences in gender ratios for leading vs. supporting roles in the 1950s.  Does there appear to be a bias?</font>

Insert your response here.