For several of the following exercises, you'll need to load several datasets using the pydataset library. (If you get an error when trying to run the import below, use pip to install the pydataset package.)

from pydataset import data

When the instructions say to load a dataset, you can pass the name of the dataset as a string to the data function to load the dataset. You can also view the documentation for the data set by passing the show_doc keyword argument.

data('mpg', show_doc=True) # view the documentation for the dataset
mpg = data('mpg') # load the dataset and store it in a variable

All the datasets loaded from the pydataset library will be pandas dataframes.


In [1]:
import pandas as pd
import numpy as np

In [2]:
from pydataset import data

1. Copy the code from the lesson to create a dataframe full of student grades.

In [3]:
np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})
df


Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98
5,John,79,76,93
6,Thomas,82,64,81
7,Marie,93,63,90
8,Albert,92,62,87
9,Richard,69,80,94


a. Create a column named passing_english that indicates whether each student has a passing grade in reading.

In [4]:
df["passing_english"] = df.english >= 70

In [5]:
df

Unnamed: 0,name,math,english,reading,passing_english
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
9,Richard,69,80,94,True


b. Sort the english grades by the passing_english column. How are duplicates handled?

In [6]:
df.sort_values(by= "passing_english") #defaults to sorting by index for duplicates

Unnamed: 0,name,math,english,reading,passing_english
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
11,Alan,92,62,72,False
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True


    c. Sort the english grades first by passing_english and then by student name. All the students that are failing english should be first, and within the students that are failing english they should be ordered alphabetically. The same should be true for the students passing english. (Hint: you can pass a list to the .sort_values method)

In [7]:
df.sort_values(by=["passing_english", "name"])

Unnamed: 0,name,math,english,reading,passing_english
11,Alan,92,62,72,False
8,Albert,92,62,87,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
4,Ada,77,92,98,True
3,Billy,98,96,88,True
10,Isaac,92,99,93,True
1,Jane,88,79,67,True
5,John,79,76,93,True
9,Richard,69,80,94,True


    d.Sort the english grades first by passing_english, and then by the actual english grade, similar to how we did in the last step.

In [8]:
df.sort_values(by=["passing_english", "english"])

Unnamed: 0,name,math,english,reading,passing_english
8,Albert,92,62,87,False
11,Alan,92,62,72,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
2,Suzie,94,74,95,True
5,John,79,76,93,True
1,Jane,88,79,67,True
9,Richard,69,80,94,True
0,Sally,62,85,80,True
4,Ada,77,92,98,True


    e. Calculate each students overall grade and add it as a column on the dataframe. The overall grade is the average of the math, english, and reading grades.

In [9]:
df["overall_grade"] = (df.math + df.english + df.reading) / 3

In [10]:
df

Unnamed: 0,name,math,english,reading,passing_english,overall_grade
0,Sally,62,85,80,True,75.666667
1,Jane,88,79,67,True,78.0
2,Suzie,94,74,95,True,87.666667
3,Billy,98,96,88,True,94.0
4,Ada,77,92,98,True,89.0
5,John,79,76,93,True,82.666667
6,Thomas,82,64,81,False,75.666667
7,Marie,93,63,90,False,82.0
8,Albert,92,62,87,False,80.333333
9,Richard,69,80,94,True,81.0


2. Load the mpg dataset. Read the documentation for the dataset and use it for the following questions:

In [11]:
mpg = data('mpg')

In [12]:
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


    a. How many rows and columns are there?

In [13]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
manufacturer    234 non-null object
model           234 non-null object
displ           234 non-null float64
year            234 non-null int64
cyl             234 non-null int64
trans           234 non-null object
drv             234 non-null object
cty             234 non-null int64
hwy             234 non-null int64
fl              234 non-null object
class           234 non-null object
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


    b. What are the data types of each column?

In [19]:
mpg.dtypes

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

    c. Summarize the dataframe with .info and .describe

In [23]:
mpg.describe

<bound method NDFrame.describe of     manufacturer   model  displ  year  cyl       trans drv  cty  hwy fl  \
1           audi      a4    1.8  1999    4    auto(l5)   f   18   29  p   
2           audi      a4    1.8  1999    4  manual(m5)   f   21   29  p   
3           audi      a4    2.0  2008    4  manual(m6)   f   20   31  p   
4           audi      a4    2.0  2008    4    auto(av)   f   21   30  p   
5           audi      a4    2.8  1999    6    auto(l5)   f   16   26  p   
..           ...     ...    ...   ...  ...         ...  ..  ...  ... ..   
230   volkswagen  passat    2.0  2008    4    auto(s6)   f   19   28  p   
231   volkswagen  passat    2.0  2008    4  manual(m6)   f   21   29  p   
232   volkswagen  passat    2.8  1999    6    auto(l5)   f   16   26  p   
233   volkswagen  passat    2.8  1999    6  manual(m5)   f   18   26  p   
234   volkswagen  passat    3.6  2008    6    auto(s6)   f   17   26  p   

       class  
1    compact  
2    compact  
3    compact  
4    

    d. Rename the cty column to city.

In [26]:
mpg = mpg.rename(columns={"cty":"city"})

    e. Rename the hwy column to highway.

In [27]:
mpg = mpg.rename(columns={"hwy":"highway"})

    f. Do any cars have better city mileage than highway mileage?

In [39]:
city_greater_than_highway = mpg.city > mpg.highway
mpg[city_greater_than_highway].sum()

manufacturer    0.0
model           0.0
displ           0.0
year            0.0
cyl             0.0
trans           0.0
drv             0.0
city            0.0
highway         0.0
fl              0.0
class           0.0
dtype: float64

    g. Create a column named mileage_difference this column should contain the difference between highway and city mileage for each car.

In [42]:
mpg["mileage_difference"] = mpg.highway - mpg.city

In [43]:
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10
...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8


    h. Which car (or cars) has the highest mileage difference?

In [45]:
higest_mpg_difference = mpg.mileage_difference.max()

In [47]:
mpg[mpg.mileage_difference == higest_mpg_difference]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12


    i. Which compact class car has the lowest highway mileage? The best?

In [51]:
mpg = mpg.rename(columns={"class":"c_class"})                  

In [69]:
list_of_compact_cars = mpg[mpg.c_class == "compact"]

In [72]:
list_of_compact_cars[list_of_compact_cars.highway == list_of_compact_cars.highway.min()]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,c_class,mileage_difference
220,volkswagen,jetta,2.8,1999,6,auto(l4),f,16,23,r,compact,7


In [73]:
list_of_compact_cars[list_of_compact_cars.highway == list_of_compact_cars.highway.max()]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,c_class,mileage_difference
213,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,11


    j. Create a column named average_mileage that is the mean of the city and highway mileage.

In [74]:
mpg["average_mileage"] = (mpg.city + mpg.highway) / 2

In [75]:
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,c_class,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9,23.5
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8,25.0
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10,21.0
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8,22.0


    k. Which dodge car has the best average mileage? The worst?

In [84]:
dodge_cars = mpg[mpg.manufacturer == "dodge"]

In [85]:
dodge_cars[dodge_cars.average_mileage == dodge_cars.average_mileage.max()]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,c_class,mileage_difference,average_mileage
38,dodge,caravan 2wd,2.4,1999,4,auto(l3),f,18,24,r,minivan,6,21.0


In [86]:
dodge_cars[dodge_cars.average_mileage == dodge_cars.average_mileage.min()]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,c_class,mileage_difference,average_mileage
55,dodge,dakota pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
60,dodge,durango 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv,3,10.5
66,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
70,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup,3,10.5


    3. Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:

In [87]:
mammals = data('Mammals')

In [88]:
mammals

Unnamed: 0,weight,speed,hoppers,specials
1,6000.0,35.0,False,False
2,4000.0,26.0,False,False
3,3000.0,25.0,False,False
4,1400.0,45.0,False,False
5,400.0,70.0,False,False
6,350.0,70.0,False,False
7,300.0,64.0,False,False
8,260.0,70.0,False,False
9,250.0,40.0,False,False
10,3800.0,25.0,False,True


    a. How many rows and columns are there?

In [94]:
mammals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 1 to 107
Data columns (total 4 columns):
weight      107 non-null float64
speed       107 non-null float64
hoppers     107 non-null bool
specials    107 non-null bool
dtypes: bool(2), float64(2)
memory usage: 2.7 KB


    b. What are the data types?

In [92]:
mammals.dtypes

weight      float64
speed       float64
hoppers        bool
specials       bool
dtype: object

    c. Summarize the dataframe with .info and .describe

In [95]:
mammals.describe()

Unnamed: 0,weight,speed
count,107.0,107.0
mean,278.688178,46.208411
std,839.608269,26.716778
min,0.016,1.6
25%,1.7,22.5
50%,34.0,48.0
75%,142.5,65.0
max,6000.0,110.0


    d. What is the the weight of the fastest animal?

In [98]:
mammals[mammals.speed == mammals.speed.max()]

Unnamed: 0,weight,speed,hoppers,specials
53,55.0,110.0,False,False


    e. What is the overal percentage of specials?

In [109]:
(len(mammals[mammals.specials == True]) / len(mammals)) * 100

9.345794392523365

    f. How many animals are hoppers that are above the median speed? What percentage is this?

In [112]:
hoppers = mammals[mammals.hoppers == True]

In [113]:
avg_speed = mammals.speed.mean()

In [121]:
len(hoppers[hoppers.speed > avg_speed])

7

In [122]:
len(hoppers[hoppers.speed > avg_speed]) / len(mammals) * 100

6.5420560747663545