# DSC 80: Lab 03

### Due Date: Saturday October 24th, Midnight - 11:59 PM

## Zoom Lab Hours
- Follow instructions on this link: https://docs.google.com/document/d/16qZpPSYhxwQDMcn-lGQjC-J-PzppLevv_mANLt2ko8g/edit 

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab**.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab**.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import lab03 as lab

In [3]:
import os
import pandas as pd
import numpy as np

---

# Hypothetically speaking...

In this section we'll develop an intuition for the terms and structure of hypothesis testing -- it's nothing to be afraid of!

The first step is always to define what you're looking at, create your hypotheses, and set a level of significance.  Once you've done that, you can find a p-value which is related to your test statistic.

If all of these words are scary: look at the lecture notebook, the textbook references, and don't forget to think about the real-world meaning of these terms!  The following example describes a real-world scenario, so you can think of it in a normal lens.

**Question 1: Faulty tires**

A tire manufacturer tests whether a set of tires meets the company's performance standards by checking:

> In 60 out of 100 tests, if a Honda CRV can come to a complete stop from 60 mph in fewer than 108 feet.

That is, 60% of the time, the stopping distance of a car outfitted with generic tires should be less than 108 feet. The factory is wondering if a current run of tires is up to standard, so they choose a random set of tires from the production line to test their performance, and bring the car to a complete stop from 60 mph a total of 100 times. Then they ask:

> Are these tires faulty?


Which of the following are valid null hypotheses that address the question we are trying to answer, using the data we are given?  Which are valid alternative hypotheses?

Outfitted with that set of tires, the car:
1. has a 60 mph stopping distance under 108 feet, at least 60% of the time.
1. has a 60 mph stopping distance under 108 feet, at most 60% of the time.
1. has a 60 mph stopping distance under 108 feet, equal to 60% of the time.
1. has at least as short stopping distance to the same car with generic tires, at least 60% of the time.
1. has at least as short stopping distance to the same car with generic tires, at most 60% of the time.
1. has at least as short stopping distance to the same car with generic tires, roughly 60% of the time.
1. is as safe as the car with generic tires.
1. causes the car to stop in a shorter distance.


Write a function `car_null_hypoth` which takes zero arguments and returns a list of the valid null hypotheses.  
Write a function `car_alt_hypoth` which takes zero arguments and returns a list of the valid alternative hypotheses.

Which of the following are valid test statistics for our question?

1. The average number of feet the car took to come to a complete stop in 100 attempts.
1. The number of times the car stopped in under 108 feet in 100 attempts.
1. The number of attempts it took before the car stopped in under 108 feet.
1. The proportion of attempts the car successfully stopped in under 108 feet.

Write a function `car_test_stat` which takes zero arguments and returns a list of valid test statistics.

The p-value is calculated as how likely it is to find something as extreme or more extreme than our observed test statistic.  To do this, we assume the null hypothesis is true, and then define "extremeness" based on the alternative hypothesis.

Why don't we just look at the probability of finding our observed test statistic?

1. Because our observed test statistic isn't extreme.
2. Because the probability of finding our observed test statistic equals the probability of finding something more extreme.
3. Because the probability of finding our observed test statistic is essentially zero.
4. Because our null hypothesis isn't suggesting equality.
5. Because our alternative hypothesis isn't suggesting equality.

Write a function `car_p_value` which takes zero arguments and returns the correct reason.

# Grouping: Google Play Store

The questions below analyze a dataset of Google Play Store apps. The dataset has been preprocessed slightly for your convenience.

Columns:
* `App`: App Name
* `Category`: App Category
* `Rating`: Average App Rating
* `Reviews`: Number of Reviews
* `Size`: Size of App
* `Installs`: Binned Number of Installs
* `Type`: Paid or Free
* `Price`: Price of App
* `Content Rating`: Age group the app is targeted at
* `Last Updated`: Last Updated Date


Link: https://www.kaggle.com/lava18/google-play-store-apps

**Question 2**

First, we'd like to do some basic cleaning to this dataset to better analyze it.
In the function `clean_apps`, which takes the Play Store dataset as input, clean as follows and return the cleaned df:
* Keep `Reviews` as type int.
* Strip all letters from the ends of `Size`, convert all units to unit kilobyte, and convert the column to type float (Hint: all Sizes end in either M (megabyte) or k (kilobyte); a helper function may be useful here).
* Strip the '+' from the ends of `Installs`, remove the commas, and convert it to type int.
* Since `Type` is binary, change all the 'Free's to 1 and the 'Paid's to 0.
* Strip dollar mark in `Price` and convert it to correct numeric data type.
* Strip all but the year (e.g. 2018) from `Last Updated` and convert it to type int.

Please return a *copy* of the original dataframe; don't alter the original.

In [50]:
play_fp = os.path.join('data', 'googleplaystore.csv')
play = pd.read_csv(play_fp)

In [25]:
play.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,"June 20, 2018"


In [81]:
fp = os.path.join('data', 'googleplaystore.csv')
df = pd.read_csv(fp)
cleaned = lab.clean_apps(df)
len(cleaned) == len(df)
#    True
cleaned.Reviews.dtype == int
 #   True

True

In [83]:
cleaned

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000.0,10000,1,0.0,Everyone,2018
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000.0,500000,1,0.0,Everyone,2018
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700.0,5000000,1,0.0,Everyone,2018
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000.0,50000000,1,0.0,Teen,2018
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800.0,100000,1,0.0,Everyone,2018
...,...,...,...,...,...,...,...,...,...,...
9140,FR Forms,BUSINESS,,0,9600.0,10,1,0.0,Everyone,2016
9141,Sya9a Maroc - FR,FAMILY,4.5,38,53000.0,5000,1,0.0,Everyone,2017
9142,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3600.0,100,1,0.0,Everyone,2018
9143,Parkinson Exercices FR,MEDICAL,,3,9500.0,1000,1,0.0,Everyone,2017


In [68]:
play.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000.0,10000,1,0.0,Everyone,2018
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000.0,500000,1,0.0,Everyone,2018
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700.0,5000000,1,0.0,Everyone,2018
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000.0,50000000,1,0.0,Teen,2018
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800.0,100000,1,0.0,Everyone,2018


In [69]:
play.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs            int64
Type                int64
Price             float64
Content Rating     object
Last Updated        int64
dtype: object

**Question 2 (Continued)**

Now, we can do some basic exploration.

In the function `store_info`, find the following using the **cleaned** dataframe:
* Find the year with the highest median `Installs`, among all years with at least 100 apps.
* Find the `Content Rating` with the highest minimum `Rating`.
* Find the `Category` has the highest average price.
* Find the `Category` with lowest average rating, among apps that have at least 1000 reviews.

and return these values in a list.

*Remark:* Note that the last question is asking you to compute the *average of averages* (the 'Rating' column contains the average rating of an app) -- such analyses are prone to occurrences of Simpson's Paradox. Considering apps with at least 1000 reviews helps limit the effect of such [ecological fallacies](https://afraenkel.github.io/practical-data-science/05/understanding-aggregations.html#reversing-aggregations-ecological-fallacies).
* You can assume there is no ties.


In [84]:
cleaned.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000.0,10000,1,0.0,Everyone,2018
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000.0,500000,1,0.0,Everyone,2018
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700.0,5000000,1,0.0,Everyone,2018
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000.0,50000000,1,0.0,Teen,2018
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800.0,100000,1,0.0,Everyone,2018


In [109]:
#Find the year with the highest median Installs, among all years with at least 100 apps.

#find years with at least 100 apps
counted = cleaned.groupby(['Last Updated']).count()
more_100 = counted[counted['App'] >= 100]
more_yr = more_100.index.tolist()

more_tb = cleaned[cleaned['Last Updated'].isin(more_yr)]
#find highest median yr
high_median_ yr = more_tb.groupby('Last Updated').median().sort_values(by = 'Installs', ascending = False).index[0]

2018

In [120]:
#find apps that have at least 1000 reviews
A = cleaned[cleaned['Reviews'] >= 1000]
z = A.groupby('Category').mean().sort_values(by = 'Rating')



In [127]:
#For each unique Content Rating, find the lowest Rating
cleaned.groupby('Content Rating').agg('min')


Unnamed: 0_level_0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated
Content Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Adults only 18+,DraftKings - Daily Fantasy Sports,COMICS,4.5,24005,4900.0,500000,1,0.0,2018
Everyone,"""i DT"" Fútbol. Todos Somos Técnicos.",ART_AND_DESIGN,1.0,0,8.5,0,0,0.0,2010
Everyone 10+,10 Best Foods for You,ART_AND_DESIGN,1.8,0,28.0,10,0,0.0,2011
Mature 17+,- Free Comics - Comic Apps,BEAUTY,1.0,0,157.0,1,0,0.0,2012
Teen,100 Doors of Revenge,ART_AND_DESIGN,2.0,0,323.0,0,0,0.0,2011
Unrated,Best CG Photography,FAMILY,4.1,1,2500.0,500,1,0.0,2012


In [129]:
fp = os.path.join('data', 'googleplaystore.csv')
df = pd.read_csv(fp)
cleaned = lab.clean_apps(df)
info = lab.store_info(cleaned)
len(info)
#    4
info[2] in cleaned.Category.unique()
#    True

True

### Transforming Apps review count by App category

A reasonable question that we may ask after cleaning the apps dataset is that how popular each app is. One way of measuring popularity of apps is by studying its review count within their respective category. 

**Question 3**
* Create a function `std_reviews_by_app_cat` that takes in a **cleaned** dataframe and outputs a dataframe with 
    - the same rows as the input,
    - two columns given by `['Category', 'Reviews']`,
    - where the `Reviews` columns are *standardized by app category* -- that is, the number of reviews for every app is put into the standard units for the category it belongs to. For a review of standard units, see the [DSC 10 Textbook](https://www.inferentialthinking.com/chapters/15/1/Correlation)
    - *Hint*: use the methoc `groupby` and `transform`.
* Lastly, create a function `su_and_spread` that returns a list of two items (hard-code your answers):
    - Consider the following scenario: half of the apps in the category 'FAMILY' receives ratings of 0 stars while the other
    half has rating of 5 stars. Similarly, the ‘MEDICAL' category has half 1-star and half 4-star apps.
    Which app would have a higher rating after standarization? The five stars in the family category or the four stars in the
    medical one. Answer with the name of the corresponding category ('FAMILY'/'MEDICAL') or use 'equal' if you think both
    rating would be the same after standarization. (Don't worry about the uppercase but do be careful with the spelling). 
    - Which category type has the biggest "spread" of review count?
    

In [130]:
cleaned.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000.0,10000,1,0.0,Everyone,2018
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000.0,500000,1,0.0,Everyone,2018
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700.0,5000000,1,0.0,Everyone,2018
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000.0,50000000,1,0.0,Teen,2018
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800.0,100000,1,0.0,Everyone,2018


In [152]:
z = cleaned.loc[:,['Category','Reviews']]

Unnamed: 0,Category,Reviews
0,ART_AND_DESIGN,159
1,ART_AND_DESIGN,967
2,ART_AND_DESIGN,87510
3,ART_AND_DESIGN,215644
4,ART_AND_DESIGN,967
...,...,...
9140,BUSINESS,0
9141,FAMILY,38
9142,FAMILY,4
9143,MEDICAL,3


In [150]:
fp = os.path.join('data', 'googleplaystore.csv')
play = pd.read_csv(fp)
clean_play = lab.clean_apps(play)
out = lab.std_reviews_by_app_cat(clean_play)
set(out.columns) == set(['Category', 'Reviews'])
#    True
np.all(abs(out.select_dtypes(include='number').mean()) < 10**-7)  # standard units should average to 0!
#    True

True

In [168]:
out.groupby('Category').max()['Reviews'] - out.groupby('Category').min()['Reviews']

#sort_values('Reviews',ascending = False)

Category
ART_AND_DESIGN          4.407841
AUTO_AND_VEHICLES       5.958826
BEAUTY                  4.721507
BOOKS_AND_REFERENCE     6.507321
BUSINESS               11.520059
COMICS                  6.561508
COMMUNICATION           8.680666
DATING                  5.455673
EDUCATION               3.962012
ENTERTAINMENT           6.030675
EVENTS                  6.286431
FAMILY                 31.709626
FINANCE                 9.553916
FOOD_AND_DRINK          5.880516
GAME                   10.431652
HEALTH_AND_FITNESS      6.418936
HOUSE_AND_HOME          5.568493
LIBRARIES_AND_DEMO      7.460496
LIFESTYLE              16.452048
MAPS_AND_NAVIGATION     9.867696
MEDICAL                10.683903
NEWS_AND_MAGAZINES      6.363415
PARENTING               7.315505
PERSONALIZATION        10.524197
PHOTOGRAPHY             7.876637
PRODUCTIVITY            8.386164
SHOPPING                6.379994
SOCIAL                 11.223361
SPORTS                 13.814516
TOOLS                  13.418616
T

In [None]:
cleaned = lab.clean_apps(play)

### Facebook Friends

**Question 4**

A group of students decided to send out a survey to their Facebook friends. Each student asks 1000 of their friends for their first and last name, the company they currently work at, their job title, their email, and the university they attended. Combine all the data contained in the files `survey*.csv` (within the `responses` folder within the data folder) into a single dataframe. The number of files and the number of rows in each file may vary, so don't hardcode your answers!

Create a function `read_survey` which takes in a directory path (containing files `survey*.csv`), and outputs a dataframe with six columns titled: `first name`, `last name`, `current company`, `job title`, `email`, `university` (in that order). 

*Hint*: You can list the files in a directory using `os.listdir`.

*Remark: You may have to do some cleaning to make this possible!*

Create a function `com_stats` that takes in in a dataframe and returns a (hardcoded) list containing: 
- The number of employees at the company that hired the most employees
- The number of emails that end in ".edu"
- The job title that has the longest name (there are no ties)
- The number of managers (hint: you may want to look through all the job titles to make sure you get all of them!)

In [219]:
out.head()

Unnamed: 0,first name,last name,current company,job title,email,university
0,Marsh,Kyllford,"Abernathy, Brown and Stiedemann",Financial Analyst,mkyllford0@naver.com,Ganja State University
1,Igor,Greatex,"Stiedemann, Eichmann and Will",Data Coordiator,igreatex1@phpbb.com,Meikai University
2,Garrick,Truscott,Collins-Hintz,Tax Accountant,gtruscott2@weather.com,Saint Paul University
3,Eleanore,Sansam,Trantow and Sons,Senior Quality Engineer,esansam3@pbs.org,Universidad del Rosario
4,Laurent,Bagley,Weissnat and Sons,Financial Advisor,lbagley4@statcounter.com,St. Petersburg StateMechnikov Medical Academy


In [266]:
out.loc[:,'job title'] = out['job title'].fillna('0')
len(out[out['job title'].str.contains('manager')])
43 + 326

369

In [251]:
(out['first name'] + out['last name']).map(len).max()

26

In [252]:
out[(out['first name'] + out['last name']).map(len) == (out['first name'] + out['last name']).map(len).max()]

Unnamed: 0,first name,last name,current company,job title,email,university
881,Mollee,Skaife d'Ingerthorpe,Wolff LLC,VP Sales,mskaifedingerthorpeoh@tiny.cc,"Islamic Azad University, Shirvan"
406,Celia,de la Valette Parisot,"Lebsack, Kilback and Grady",Legal Assistant,cdelavaletteparisotba@sbwire.com,University of Engineering and Technology Lahore


In [169]:
dirname = os.path.join('data', 'responses')
lab.read_survey(dirname)

Ellipsis

In [170]:
dirnames = os.listdir(dirname)


['survey5.csv', 'survey4.csv', 'survey1.csv', 'survey3.csv', 'survey2.csv']

In [183]:
suv1 = pd.read_csv(dirname + '/survey1.csv')
suv1.head()

Unnamed: 0,first name,last name,job title,email,current company,university
0,Marsh,Kyllford,Financial Analyst,mkyllford0@naver.com,"Abernathy, Brown and Stiedemann",Ganja State University
1,Igor,Greatex,Data Coordiator,igreatex1@phpbb.com,"Stiedemann, Eichmann and Will",Meikai University
2,Garrick,Truscott,Tax Accountant,gtruscott2@weather.com,Collins-Hintz,Saint Paul University
3,Eleanore,Sansam,Senior Quality Engineer,esansam3@pbs.org,Trantow and Sons,Universidad del Rosario
4,Laurent,Bagley,Financial Advisor,lbagley4@statcounter.com,Weissnat and Sons,St. Petersburg StateMechnikov Medical Academy


In [185]:
#rearange survey1 column order
key = ['first name', 'last name', 'current company', 'job title', 'email', 'university']
suv1 = suv1[key]

In [186]:
suv2 = pd.read_csv(dirname + '/survey2.csv')
suv2.head()

Unnamed: 0,CURRENT COMPANY,JOB TITLE,FIRST NAME,LAST NAME,EMAIL,UNIVERSITY
0,Harvey Inc,Safety Technician IV,Ardelia,Winspurr,awinspurr0@timesonline.co.uk,Universidad Autónoma de Yucatán
1,Johnston-Hermann,Structural Engineer,Ileane,Balhatchet,ibalhatchet1@fastcompany.com,Technical University of Opole
2,Dibbert-Lemke,Human Resources Assistant III,Damita,Seamer,dseamer2@elegantthemes.com,Osaka City University
3,"Rutherford, Schiller and Skiles",Staff Accountant III,Krystal,Clerc,kclerc3@lulu.com,"DeVry Institute of Technology, Decatur"
4,"Luettgen, Anderson and Green",Automation Specialist III,Kirsti,Raithbie,kraithbie4@liveinternet.ru,University of Maryland Medicine


In [192]:
#lower survey2 column names
suv2 = suv2.rename(str.lower,axis = 1)
suv2 = suv2[key]

In [193]:
suv3 = pd.read_csv(dirname + '/survey3.csv')
suv3.head()

Unnamed: 0,CURRENT_COMPANY,EMAIL,FIRST_NAME,LAST_NAME,JOB_TITLE,UNIVERSITY
0,"Herman, Robel and Krajcik",rmournian0@census.gov,Rock,Mournian,Software Engineer IV,Tokushima University
1,Parisian-Powlowski,hkollatsch1@mail.ru,Helena,Kollatsch,Information Systems Manager,Musashi University
2,Treutel and Sons,aweall2@flavors.me,Amalita,Weall,VP Accounting,University Konstantina Filozov in Nitra
3,D'Amore-Kiehn,egroucutt3@howstuffworks.com,Elvyn,Groucutt,Executive Secretary,Nizhny Novgorod State University
4,Pagac Group,sbiskupek4@go.com,Suzie,Biskupek,Electrical Engineer,University of Essex


In [201]:
#lower survey3 column names and strip _
suv3 = suv3.rename(str.lower,axis = 1)
suv3.columns = suv3.columns.str.replace('_',' ')
suv3 = suv3[key]

In [205]:
suv4 = pd.read_csv(dirname + '/survey4.csv')
suv4.head()

Unnamed: 0,current_Company,email,first_Name,last_Name,job_title,university
0,Brakus-Aufderhar,cplume0@merriam-webster.com,Charyl,Plume,Dental Hygienist,Université Mohammed Ier
1,Stamm and Sons,fgregersen1@baidu.com,Filbert,Gregersen,Tax Accountant,Universitas Sam Ratulangi
2,"Jaskolski, Gulgowski and Corkery",lmeighan2@domainmarket.com,Lesli,Meighan,Graphic Designer,Sadat Institute of Higher Education
3,"Terry, Howell and Nitzsche",etop3@harvard.edu,Emory,Top,Actuary,Instituto Tecnológico de Aeronáutica
4,"Mills, Reichel and Muller",fmackerness4@liveinternet.ru,Felike,Mackerness,Media Manager IV,Ungku Omar Premier Polytechnic


In [208]:
#similar work as survey3 
suv4.columns = suv4.columns.str.replace('_',' ')
suv4 = suv4.rename(str.lower,axis = 1)
suv4 = suv4[key]

Unnamed: 0,first name,last name,current company,job title,email,university
0,Charyl,Plume,Brakus-Aufderhar,Dental Hygienist,cplume0@merriam-webster.com,Université Mohammed Ier
1,Filbert,Gregersen,Stamm and Sons,Tax Accountant,fgregersen1@baidu.com,Universitas Sam Ratulangi
2,Lesli,Meighan,"Jaskolski, Gulgowski and Corkery",Graphic Designer,lmeighan2@domainmarket.com,Sadat Institute of Higher Education
3,Emory,Top,"Terry, Howell and Nitzsche",Actuary,etop3@harvard.edu,Instituto Tecnológico de Aeronáutica
4,Felike,Mackerness,"Mills, Reichel and Muller",Media Manager IV,fmackerness4@liveinternet.ru,Ungku Omar Premier Polytechnic
...,...,...,...,...,...,...
995,Liva,Denson,Funk Group,Nuclear Power Engineer,ldensonrn@economist.com,Wagner College
996,Concettina,Elcoat,"Harber, Ziemann and Upton",Physical Therapy Assistant,celcoatro@about.com,Moscow State University
997,Rozele,Agirre,O'Reilly LLC,Data Coordiator,ragirrerp@spotify.com,University of North Carolina at Charlotte
998,Rose,Saville,Kling Inc,Graphic Designer,rsavillerq@nationalgeographic.com,University of Twente


In [209]:
suv5 = pd.read_csv(dirname + '/survey5.csv')
suv5.head()

Unnamed: 0,email,first_Name,last_Name,job_title,university,current_Company
0,cpippin0@wordpress.org,Consuelo,Pippin,Dental Hygienist,Military University Shoumen,Mayert and Sons
1,agotts1@posterous.com,Amata,Gotts,Civil Engineer,Abo Akademi University,Padberg and Sons
2,gwarmisham2@techcrunch.com,Glori,Warmisham,Senior Developer,St. Vincent College,Bruen-Rosenbaum
3,blytle3@businessweek.com,Byron,Lytle,Developer IV,"University of the West Indies, Mona",Upton Inc
4,cswadlin4@drupal.org,Carolee,Swadlin,Librarian,Novosibirsk State University,Swift-Lemke


In [211]:
#similar work as survey4
suv5.columns = suv5.columns.str.replace('_',' ')
suv5 = suv5.rename(str.lower,axis = 1)
suv5 = suv5[key]
suv5.head()

Unnamed: 0,first name,last name,current company,job title,email,university
0,Consuelo,Pippin,Mayert and Sons,Dental Hygienist,cpippin0@wordpress.org,Military University Shoumen
1,Amata,Gotts,Padberg and Sons,Civil Engineer,agotts1@posterous.com,Abo Akademi University
2,Glori,Warmisham,Bruen-Rosenbaum,Senior Developer,gwarmisham2@techcrunch.com,St. Vincent College
3,Byron,Lytle,Upton Inc,Developer IV,blytle3@businessweek.com,"University of the West Indies, Mona"
4,Carolee,Swadlin,Swift-Lemke,Librarian,cswadlin4@drupal.org,Novosibirsk State University


In [212]:
#combine all df together
pd.concat([suv1,suv2,suv3,suv4,suv5])

Unnamed: 0,first name,last name,current company,job title,email,university
0,Marsh,Kyllford,"Abernathy, Brown and Stiedemann",Financial Analyst,mkyllford0@naver.com,Ganja State University
1,Igor,Greatex,"Stiedemann, Eichmann and Will",Data Coordiator,igreatex1@phpbb.com,Meikai University
2,Garrick,Truscott,Collins-Hintz,Tax Accountant,gtruscott2@weather.com,Saint Paul University
3,Eleanore,Sansam,Trantow and Sons,Senior Quality Engineer,esansam3@pbs.org,Universidad del Rosario
4,Laurent,Bagley,Weissnat and Sons,Financial Advisor,lbagley4@statcounter.com,St. Petersburg StateMechnikov Medical Academy
...,...,...,...,...,...,...
995,Hurleigh,Sphinxe,Kozey-Hamill,Marketing Manager,hsphinxern@eventbrite.com,Southampton Solent University
996,Michele,Finch,Ernser and Sons,Account Coordinator,mfinchro@jigsy.com,Glion Institute of Higher Education
997,Harlie,Heister,Schuppe-McClure,Technical Writer,hheisterrp@last.fm,Tianjin University of Finance & Economics
998,Phillipe,Dibble,"Barrows, Batz and Dickens",Analog Circuit Design manager,pdibblerq@jugem.jp,University of South Bohemia


In [338]:
dirname = os.path.join('data', 'responses')
out = lab.read_survey(dirname)
isinstance(out, pd.DataFrame)
#    True
len(out)
#    5000
#lab.read_survey('nonexistentfile') # doctest: +ELLIPSIS
#    Traceback (most recent call last):
 #   ...
  #  FileNotFoundError: ... 'nonexistentfile'

5000

### Combining Data
**Question 5**

Every week, a professor sends out an extra credit survey asking for students' favorite things (animals, movies, etc). 
- Each student who has completed at least 75% of the surveys receives 5 points of extra credit.
- If at least 90% of the class answers at least one of the questions (ex. favorite animal), *everyone* in the class receives 1 point of extra credit. This overall class extra credit only applies once (ex. If 95% of students answer favorite color and 91% answer favorite animal, the entire class still only receives 1 extra point as a class).

Create a function `combine_surveys` which takes in a directory path (containing files `favorite*.csv`) and combines all of the survey data into one DataFrame, indexed by student ID (a value 1 - 1000).

Create a function `check_credit` which takes in a DataFrame with the combined survey data and outputs a DataFrame of the names of students and how many extra credit points they would receive, indexed by their ID (a value 1-1000)

In [384]:
threshold = 4 * 0.25
notAnswer = out.isna().sum(axis = 1)

checked = notAnswer <= threshold

In [382]:
#(notAnswer <= 1).sum()

5

In [385]:
ex = checked.replace({True:5,False:0})

In [391]:
out

Unnamed: 0_level_0,name_x,name_y,movie,genre,animal,plant
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Myrtia,Myrtia,,(no genres listed),Long-crested hawk eagle,
2,Nathanil,Nathanil,,Documentary,Euro wallaby,
3,Joni,Joni,"Glass-blower's Children, The (Glasblåsarns barn)",,Brown brocket,
4,Prentice,Prentice,,(no genres listed),"Peccary, white-lipped",
5,Claudette,Claudette,,,"Capuchin, brown",
...,...,...,...,...,...,...
996,,Addie,Kung Phooey!,Horror|Mystery|Sci-Fi,"Eland, common",
997,,Valaria,Angel Heart,,Agouti,
998,,Gunilla,,,"Shelduck, european",
999,,Zitella,,Comedy,,


In [406]:
#check class extra credit
student_thresh = len(out) * 0.1

overall = (out[['movie','genre','animal','plant']].isna().sum() <= student_thresh).sum() >= 1
if overall == True:
    ex = ex + 1

In [410]:
dirname = os.path.join('data', 'extra-credit-surveys')
df = lab.combine_surveys(dirname)
out = lab.check_credit(df)
out.shape
#    (1000, 2)
out

Unnamed: 0_level_0,name_y,extra
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Myrtia,0
2,Nathanil,0
3,Joni,0
4,Prentice,0
5,Claudette,0
...,...,...
996,Addie,0
997,Valaria,0
998,Gunilla,0
999,Zitella,0


In [407]:
out = out.assign(extra = ex)
out

Unnamed: 0_level_0,name_x,name_y,movie,genre,animal,plant,extra
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Myrtia,Myrtia,,(no genres listed),Long-crested hawk eagle,,0
2,Nathanil,Nathanil,,Documentary,Euro wallaby,,0
3,Joni,Joni,"Glass-blower's Children, The (Glasblåsarns barn)",,Brown brocket,,0
4,Prentice,Prentice,,(no genres listed),"Peccary, white-lipped",,0
5,Claudette,Claudette,,,"Capuchin, brown",,0
...,...,...,...,...,...,...,...
996,,Addie,Kung Phooey!,Horror|Mystery|Sci-Fi,"Eland, common",,0
997,,Valaria,Angel Heart,,Agouti,,0
998,,Gunilla,,,"Shelduck, european",,0
999,,Zitella,,Comedy,,,0


In [312]:
out.apply(lambda x: 5 if (out.isna().sum(axis = 1) > threshold) else 0, axis = 1)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [267]:
dirname = os.path.join('data', 'extra-credit-surveys')

In [272]:
names = os.listdir(dirname)

In [282]:
pd.read_csv(dirname + '/favorite1.csv')

Unnamed: 0,id,name
0,1,Myrtia
1,2,Nathanil
2,3,Joni
3,4,Prentice
4,5,Claudette
...,...,...
995,996,Addie
996,997,Valaria
997,998,Gunilla
998,999,Zitella


In [283]:
pd.read_csv(dirname + '/favorite2.csv')

Unnamed: 0,id,movie
0,1,
1,2,
2,3,"Glass-blower's Children, The (Glasblåsarns barn)"
3,4,
4,5,
...,...,...
995,996,Kung Phooey!
996,997,Angel Heart
997,998,
998,999,


In [358]:
dirname = os.path.join('data', 'extra-credit-surveys')
out = lab.combine_surveys(dirname)
isinstance(out, pd.DataFrame)
#    True
out.shape
#    (1000, 6)
#lab.combine_surveys('nonexistentfile') # doctest: +ELLIPSIS
 #   Traceback (most recent call last):
  #  ...
   # FileNotFoundError: ... 'nonexistentfile'
    #"""

(1000, 6)

In [293]:
out

Unnamed: 0_level_0,name_x,name_y,movie,genre,animal,plant
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Myrtia,Myrtia,,(no genres listed),Long-crested hawk eagle,
2,Nathanil,Nathanil,,Documentary,Euro wallaby,
3,Joni,Joni,"Glass-blower's Children, The (Glasblåsarns barn)",,Brown brocket,
4,Prentice,Prentice,,(no genres listed),"Peccary, white-lipped",
5,Claudette,Claudette,,,"Capuchin, brown",
...,...,...,...,...,...,...
996,,Addie,Kung Phooey!,Horror|Mystery|Sci-Fi,"Eland, common",
997,,Valaria,Angel Heart,,Agouti,
998,,Gunilla,,,"Shelduck, european",
999,,Zitella,,Comedy,,


### Joining pets and owners

**Question 6**

You are analyzing data from a veterinarian clinic. The datasets contain several types of information from the clinic, including its customers (pet owners), pets, and available procedures and history. The column names are self-explanatory. These dataframes are provided to you:
-  `owners` stores the customer information, where every `OwnerID` is unique (verify yourself).
-  `pets` stores the pet information. Each pet belongs to a customer in `owners`.
-  `procedure_detail` contains a catalog of procedures that are offered by the clinic.
-  `procedure_history` has procedure records. Each procedure is given to a pet in `pets`.

You want to answer the following questions:

1. What is the most popular Procedure Type for all of the pets we have in our `pets` dataset? Note that some pets are registered but haven't had any procedure performed. Also, some pets that have had procedures done, are not registered in `pets`. Create a function `most_popular_procedure` that takes in `pets`, `procedure_history` and returns the name of the most popular Procedure Type as a string.
 
2. What is the name of each customer's pet(s)? Create a function `pet_name_by_owner` that takes in `owners`, `pets` and returns a Series that holds the pet name (as a string) indexed by owner's (first) name. If an owner has multiple pets, the corresponding value should be a list of names as strings.

3. For each city that had owners who had their pets in our procedure history, how much does the city spend in total on procedures? Create a function `total_cost_per_city` that returns a Series that contains the sum of money that a city has spent on pets' procedures, indexed by `City`. Hint: think of what makes a procedure unique in the context of this dataset.

In [10]:
owners_fp = os.path.join('data', 'pets', 'Owners.csv')
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
procedure_detail_fp = os.path.join('data', 'pets', 'ProceduresDetails.csv')
procedure_history_fp = os.path.join('data', 'pets', 'ProceduresHistory.csv')

In [12]:
owners = pd.read_csv(owners_fp)
pets = pd.read_csv(pets_fp)
procedure_detail = pd.read_csv(procedure_detail_fp)
procedure_history = pd.read_csv(procedure_history_fp)

AttributeError: 'DataFrame' object has no attribute 'read_csv'

In [423]:
procedure_history['ProcedureType'].value_counts()

VACCINATIONS         1447
GROOMING              436
GENERAL SURGERIES     271
ORTHOPEDIC             90
OFFICE FEES            30
HOSPITALIZATION        10
Name: ProcedureType, dtype: int64

In [603]:
owners

Unnamed: 0,OwnerID,Name,Surname,StreetAddress,City,State,StateFull,ZipCode
0,6049,Debbie,Metivier,315 Goff Avenue,Grand Rapids,MI,Michigan,49503
1,2863,John,Sebastian,3221 Perry Street,Davison,MI,Michigan,48423
2,3518,Connie,Pauley,1539 Cunningham Court,Bloomfield Township,MI,Michigan,48302
3,3663,Lena,Haliburton,4217 Twin Oaks Drive,Traverse City,MI,Michigan,49684
4,1070,Jessica,Velazquez,3861 Woodbridge Lane,Southfield,MI,Michigan,48034
...,...,...,...,...,...,...,...,...
84,2103,Robert,Adkins,2102 Perry Street,Flint,MI,Michigan,48548
85,4464,Daniel,Nielson,4876 Tully Street,Detroit,MI,Michigan,48219
86,5737,Alden,McMiller,3111 Tennessee Avenue,Pontiac,MI,Michigan,48342
87,9850,Gary,Snider,3139 Nash Street,Detroit,MI,Michigan,48227


In [604]:
pets

Unnamed: 0,PetID,Name,Kind,Gender,Age,OwnerID
0,J6-8562,Blackie,Dog,male,11,5168
1,Q0-2001,Roomba,Cat,male,9,5508
2,M0-2904,Simba,Cat,male,1,3086
3,R3-7551,Keller,Parrot,female,2,7908
4,P2-7342,Cuddles,Dog,male,13,4378
...,...,...,...,...,...,...
95,U8-6473,Biscuit,Dog,female,3,1070
96,I5-4893,Cookie,Cat,female,3,7340
97,Q8-0954,Lakshmi,Cat,female,7,9385
98,N0-9539,Swiffer,Cat,male,14,9365


In [609]:
owners_ss = owners[['OwnerID','Name']]
owners_ss = owners_ss.rename({'Name':'Owner'},axis = 1)
pets_ss = pets[['OwnerID','Name']]
merged = pd.merge(owners_ss,pets_ss,on = 'OwnerID',how = 'outer')
merged





Unnamed: 0,OwnerID,Owner,Name
0,6049,Debbie,Biscuit
1,2863,John,Biscuit
2,3518,Connie,Biscuit
3,3663,Lena,Biscuit
4,1070,Jessica,Biscuit
...,...,...,...
95,4464,Daniel,Scooter
96,5737,Alden,Scooter
97,9850,Gary,Scooter
98,9850,Gary,Daisy


In [613]:
#get duplicateds owner id
more = merged[merged['OwnerID'].duplicated()]
ids = more['OwnerID'].value_counts().index.tolist()

#drop duplicates
merged = merged.drop_duplicates(subset = ['OwnerID'])

for i in ids:
    hold = []
    #find first occurance
    hold.append(merged[merged['OwnerID'] == i]['Name'].iloc[0])
    #find other names
   # print(more[more['OwnerID'] == i ]['Name'].values.tolist())
    hold.extend(more[more['OwnerID'] == i ]['Name'].values.tolist())
    
    #convert to string
    hold = [', '.join(x for x in hold)][0]
    
    #find idx
    idx = merged[merged['OwnerID'] == i].index[0]
    #change it 
    merged[merged['OwnerID'] == i].loc[idx,'Name']= hold

merged.set_index('Owner')['Name']

Owner
Debbie     Biscuit
John       Biscuit
Connie     Biscuit
Lena       Biscuit
Jessica    Biscuit
            ...   
Robert         Taz
Daniel     Scooter
Alden      Scooter
Gary       Scooter
Joseph     Blackie
Name: Name, Length: 89, dtype: object

In [551]:
idx = merged[merged['OwnerID'] == 1546].index[0]
#merged[merged['OwnerID'] == 1546].iat[0,2] ='x'
merged[merged['OwnerID'] == 1546].loc[idx,'Name']= 'X'

#merged[merged['OwnerID'] == 1546]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [593]:
merged[merged['OwnerID'] == 9850]['Name'].iloc[0]


'Scooter'

In [568]:
z = ['a','b','c']
[', '.join(x for x in z)]

['a, b, c']

In [430]:
pets['OwnerID'].value_counts()

5508    3
8133    3
3089    3
7484    2
7846    2
       ..
9427    1
5207    1
4185    1
3034    1
3663    1
Name: OwnerID, Length: 89, dtype: int64

In [424]:
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
procedure_history_fp = os.path.join('data', 'pets', 'ProceduresHistory.csv')
pets = pd.read_csv(pets_fp)
procedure_history = pd.read_csv(procedure_history_fp)
out = lab.most_popular_procedure(pets, procedure_history)
isinstance(out,str)
#    True

True

In [616]:
owners_fp = os.path.join('data', 'pets', 'Owners.csv')
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
owners = pd.read_csv(owners_fp)
pets = pd.read_csv(pets_fp)
out = lab.pet_name_by_owner(owners, pets)
len(out) == len(owners)
#    True
'Sarah' in out.index
#    True
'Cookie' in out.values

True

In [None]:
procedure_detail

In [None]:
lab.pet_name_by_owner(owners, pets)

In [None]:
#For each city that had owners who had their pets in our procedure history,
#how much does the city spend in total on procedures? Create a function total_cost_per_city 
#that returns a Series that contains the sum of money that a city has spent on pets' procedures, 
#indexed by City. Hint: think of what makes a procedure unique in the context of this dataset.

In [672]:
procedure_history.head()

Unnamed: 0,PetID,Date,ProcedureType,ProcedureSubCode
0,A8-1181,2016-01-10,VACCINATIONS,5
1,E7-3766,2016-01-11,VACCINATIONS,5
2,B8-8740,2016-01-11,VACCINATIONS,5
3,D4-9443,2016-01-11,VACCINATIONS,5
4,F6-3398,2016-01-12,HOSPITALIZATION,1


In [673]:
owners.head()

Unnamed: 0,OwnerID,Name,Surname,StreetAddress,City,State,StateFull,ZipCode
0,6049,Debbie,Metivier,315 Goff Avenue,Grand Rapids,MI,Michigan,49503
1,2863,John,Sebastian,3221 Perry Street,Davison,MI,Michigan,48423
2,3518,Connie,Pauley,1539 Cunningham Court,Bloomfield Township,MI,Michigan,48302
3,3663,Lena,Haliburton,4217 Twin Oaks Drive,Traverse City,MI,Michigan,49684
4,1070,Jessica,Velazquez,3861 Woodbridge Lane,Southfield,MI,Michigan,48034


In [674]:
pets.head()

Unnamed: 0,PetID,Name,Kind,Gender,Age,OwnerID
0,J6-8562,Blackie,Dog,male,11,5168
1,Q0-2001,Roomba,Cat,male,9,5508
2,M0-2904,Simba,Cat,male,1,3086
3,R3-7551,Keller,Parrot,female,2,7908
4,P2-7342,Cuddles,Dog,male,13,4378


In [675]:
procedure_detail.head()

Unnamed: 0,ProcedureType,ProcedureSubCode,Description,Price
0,OFFICE FEES,1,Office Call,32
1,OFFICE FEES,2,Emergency,100
2,OFFICE FEES,3,Reck,24
3,GROOMING,1,Bath,15
4,GROOMING,2,Flea Dip,15


In [16]:
#only care ownerid and city
own = owners[['OwnerID','City']]
#only care petid and owner id
pet_s = pets[['OwnerID','PetID']]
#combine owner and pet
pet_own = own.merge(pet_s,on='OwnerID')


In [17]:
pet_own.head()


Unnamed: 0,OwnerID,City,PetID
0,6049,Grand Rapids,I6-9459
1,2863,Davison,R4-6131
2,3518,Bloomfield Township,N6-7350
3,3663,Traverse City,U4-6674
4,1070,Southfield,U8-6473


In [21]:

#conly care petID and procedure code
pc = procedure_history[['PetID','ProcedureSubCode']]
#only care subcode and price
pd = procedure_detail[['ProcedureSubCode','Price']]

#combine all
fin = pc.merge(pet_own,on = 'PetID')
fini = pd.merge(fin, on = 'ProcedureSubCode')

#only care price and city
subset = fini[['Price','City']]
subset.groupby('City').sum()

Unnamed: 0_level_0,Price
City,Unnamed: 1_level_1
Ann Arbor,951
Center Line,665
Commerce,665
Detroit,665
East Lansing,415
Farmington Hills,236
Flint,637
Grand Rapids,4233
Kalamazoo,236
Lansing,1995


In [26]:
owners_fp = os.path.join('data', 'pets', 'Owners.csv')
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
procedure_detail_fp = os.path.join('data', 'pets', 'ProceduresDetails.csv')
procedure_history_fp = os.path.join('data', 'pets', 'ProceduresHistory.csv')
owners = pd.read_csv(owners_fp)
pets = pd.read_csv(pets_fp)
procedure_detail = pd.read_csv(procedure_detail_fp)
procedure_history = pd.read_csv(procedure_history_fp)
out = lab.total_cost_per_city(owners, pets, procedure_history, procedure_detail)
set(out.index) <= set(owners['City'])
#    True



True

## Congratulations! You're done!

* Submit the lab on Gradescope

In [23]:
import os
import pandas as pd
import numpy as np