# Cleaning: Cycle Share

There are 3 datasets that provide data on the stations, trips, and weather from 2014-2016.

**Station dataset**

* station_id: station ID number
* name: name of station
* lat: station latitude
* long: station longitude
* install_date: date that station was placed in service
* install_dockcount: number of docks at each station on the installation date
* modification_date: date that station was modified, resulting in a change in location or dock count
* current_dockcount: number of docks at each station on 8/31/2016
* decommission_date: date that station was placed out of service

**Trip dataset**

* trip_id: numeric ID of bike trip taken
* starttime: day and time trip started, in PST
* stoptime: day and time trip ended, in PST
* bikeid: ID attached to each bike
* tripduration: time of trip in seconds
* from_station_name: name of station where trip originated
* to_station_name: name of station where trip terminated
* from_station_id: ID of station where trip originated
* to_station_id: ID of station where trip terminated
* usertype: "Short-Term Pass Holder" is a rider who purchased a 24-Hour or 3-Day Pass; "Member" is a rider who purchased a Monthly or an Annual Membership
* gender: gender of rider
* birthyear: birth year of rider

**Weather dataset** contains daily weather information in the service area

In [1]:
# Import standard data science modules
import pandas as pd
from pandas import DataFrame as DF
from pandas import Series
import numpy as np
import math

# Import listdir module for finding files to load
from os import listdir

## 1. Import all sets into a dictionary and correct any errors

In [2]:
# Read all data files from cycle_share folder
files = listdir('cycle_share')

# Iterate through file names, load data to a dictionary
data = {}
for f in files:
    k = f.split('.')[0]  # remove .csv
    path = 'cycle_share/' + f
    try:
        data[k] = pd.read_csv(path)
    except Exception as e:
        print('File : {}\n'.format(f))
        print(e)

File : trip.csv

Error tokenizing data. C error: Expected 12 fields in line 50794, saw 20



In [3]:
# Error in line 50794 (index 50793): compare to another line to view differences
with open('cycle_share/trip.csv') as f:
    lines = f.readlines()
for l in lines[50793:50795]:
    print(l, '\n')

59000,"4/17/2015 14:21","4/17/2015 19:21","SEA00362",17990.668,"6th Ave S & S King St","Westlake Ave & 6th Ave","ID-04","SLU-15"trip_id","starttime","stoptime","bikeid","tripduration","from_station_name","to_station_name","from_station_id","to_station_id","usertype","gender","birthyear"
 

431,"10/13/2014 10:31","10/13/2014 10:48","SEA00298",985.935,"2nd Ave & Spring St","Occidental Park / Occidental Ave S & S Washington St","CBD-06","PS-04","Member","Male",1960
 



In [4]:
# How many columns are detected in line 50793 vs 50794?
len(lines[50793].split(',')), len(lines[50794].split(','))

(20, 12)

It looks like column names are attached to the end of the 'offending row'. Examine at the first line in the file to see if column names are also at the top:

In [5]:
# Print tokens for column names
lines[0].split(',')

['"trip_id"',
 '"starttime"',
 '"stoptime"',
 '"bikeid"',
 '"tripduration"',
 '"from_station_name"',
 '"to_station_name"',
 '"from_station_id"',
 '"to_station_id"',
 '"usertype"',
 '"gender"',
 '"birthyear"\n']

**So we just have some odd problem where column names also ended up mashed into the middle of our file**
Let's correct the problem in the offending line, and re-save the text file.

In [6]:
# Identify bad line and tokenize
bad_line = lines[50793]
bad_tokens = bad_line.split(',')
bad_tokens

['59000',
 '"4/17/2015 14:21"',
 '"4/17/2015 19:21"',
 '"SEA00362"',
 '17990.668',
 '"6th Ave S & S King St"',
 '"Westlake Ave & 6th Ave"',
 '"ID-04"',
 '"SLU-15"trip_id"',
 '"starttime"',
 '"stoptime"',
 '"bikeid"',
 '"tripduration"',
 '"from_station_name"',
 '"to_station_name"',
 '"from_station_id"',
 '"to_station_id"',
 '"usertype"',
 '"gender"',
 '"birthyear"\n']

In `bad_tokens`, nothing is valid beyond "SLU-15" (the `to_station_id`). Solution: consider remaining values as null, inserting nothing between commas to denote that.

In [7]:
new_tokens = bad_tokens[:9]
new_tokens

['59000',
 '"4/17/2015 14:21"',
 '"4/17/2015 19:21"',
 '"SEA00362"',
 '17990.668',
 '"6th Ave S & S King St"',
 '"Westlake Ave & 6th Ave"',
 '"ID-04"',
 '"SLU-15"trip_id"']

Correct 'trip_id' at the end of new_tokens

In [8]:
# Correct the "trip_id" at the end
end = new_tokens[-1]
end = '"' + end.split('"')[1] + '"'
end

'"SLU-15"'

In [9]:
new_tokens[-1] = end
new_tokens.append(2*',' + '\n')
new_line = ','.join(new_tokens)
new_line

'59000,"4/17/2015 14:21","4/17/2015 19:21","SEA00362",17990.668,"6th Ave S & S King St","Westlake Ave & 6th Ave","ID-04","SLU-15",,,\n'

In [10]:
# check length (should be 12)
len(new_line.split(','))

12

In [11]:
# replace bad line with new one
lines[50793] = new_line

In [12]:
# write lines to new file to correct error
with open('cycle_share/trip_fixed.csv','w') as f:
    for l in lines:
        f.write(l)

#### Try reading files in again
(Skipping the origin trip.csv)

In [13]:
# Clear data dictionary, iterate through file names and load data to a dictionary
data = {}
for f in files:
    # we want to load trip_fixed, not trip
    if f != 'trip.csv':
        k = f.split('.')[0]  # remove .csv
        path = 'cycle_share/' + f
        try:
            # change key for dict
            if k == 'trip_fixed':
                k = 'trip'
            data[k] = pd.read_csv(path)
        except Exception as e:
            print('File : {}\n'.format(f))
            print(e)

## 2. Print data summaries including the number of null values. Should we drop or try to correct any of the null values?

In [14]:
# Loop through data dictionary
for k in data.keys():
    # print data name
    print(k.upper(), '\n')
    
    # show null couts
    print('Null counts')
    print(data[k].isnull().sum(), '\n')
    
    # summaries for columns of numeric type
    print(data[k].describe().T, '\n')
    # summaries for columns of 'object' type
    
    print(data[k].describe(include=['O']).T, '\n\n', 50*'-', '\n')

STATION 

Null counts
station_id            0
name                  0
lat                   0
long                  0
install_date          0
install_dockcount     0
modification_date    41
current_dockcount     0
decommission_date    54
dtype: int64 

                   count        mean       std         min         25%  \
lat                 58.0   47.624796  0.019066   47.598488   47.613239   
long                58.0 -122.327242  0.014957 -122.355230 -122.338735   
install_dockcount   58.0   17.586207  3.060985   12.000000   16.000000   
current_dockcount   58.0   16.517241  5.117021    0.000000   16.000000   

                          50%         75%         max  
lat                 47.618591   47.627712   47.666145  
long              -122.328207 -122.316691 -122.284119  
install_dockcount   18.000000   18.000000   30.000000  
current_dockcount   18.000000   18.000000   26.000000   

                  count unique                    top freq
station_id           58     58     

trip_id                   0
starttime                 0
stoptime                  0
bikeid                    0
tripduration              0
from_station_name         0
to_station_name           0
from_station_id           0
to_station_id             0
usertype                  1
gender               105301
birthyear            105305
dtype: int64 

                 count           mean           std       min          25%  \
trip_id       286858.0  112431.781746  76565.086482   431.000  43051.00000   
tripduration  286858.0    1178.354284   2038.697070    60.008    387.92575   
birthyear     181553.0    1979.759062     10.167119  1931.000   1974.00000   

                      50%           75%         max  
trip_id       103486.5000  179544.75000  255245.000  
tripduration     624.8465    1118.48325   28794.398  
birthyear       1983.0000    1987.00000    1999.000   

                    count  unique                              top    freq
starttime          286858  176216          

None of the null values in the data are critical for analysis. While there is not a sensible way to impute `modification_date` from the existing information, `Mean_temperature_F` can be reasonably be imputed as the mean of the two days bracketing the null value. This is because it is unlikely for mean temperatures to vary drastically from one day to the next, although this reasoning will not work for extended stretches of null values where there is more likely to be a shift in mean temperature. There is no reason to drop any missing values from the data from the information we have.

In [15]:
data['weather'].Mean_Temperature_F.fillna(method = 'ffill', inplace = True)

## 3. Create a column in the trip table that contains only the date (no time)

In [16]:
# what is present in the current date value?
data['trip'].starttime[0]


'10/13/2014 10:31'

In [17]:
# date and time are separated by a space, so split each accordingly using list comprehension
data['trip']['date'] = [t.split(' ')[0] for t in data['trip'].starttime]

In [18]:
# print the first 5 dates
data['trip'].date.head()

0    10/13/2014
1    10/13/2014
2    10/13/2014
3    10/13/2014
4    10/13/2014
Name: date, dtype: object

## 4. Merge weather data with trip data and be sure not to lose any trip data

In [19]:
# a left join will preserve all trip data and leave dates with no weather (if any) as null
trip_weather = data['trip'].merge(data['weather'], left_on = 'date', right_on = 'Date', how = 'left')

# now drop the unnecessary date/Date columns
trip_weather.drop(['date','Date'], axis = 1, inplace = True)

In [20]:
len(trip_weather) == len(data['trip'])

True

## 5. Drop records that are completely duplicated (all values). Check for and inspect any duplicate trip_id values that remain. Remove if they exist.

In [21]:
# check starting shape of 'trip_weather' data
trip_weather.shape

(286858, 32)

In [22]:
# drop duplicates and reset index
trip_weather.drop_duplicates(inplace = True)
trip_weather.reset_index(drop = True, inplace = True)

# check size of resulting data
trip_weather.shape

(236066, 32)

In [23]:
# get number of duplicated trip IDs
dup_mask = trip_weather.trip_id.duplicated()
dup_mask.sum()

1

In [24]:
dup_trip_id = trip_weather.loc[dup_mask[dup_mask == True].index].trip_id

# look at data for duplicates
trip_weather[trip_weather.trip_id.isin(dup_trip_id)].sort_values(by='trip_id').T

Unnamed: 0,50792,50793
trip_id,59000,59000
starttime,4/17/2015 14:21,4/17/2015 14:21
stoptime,4/17/2015 19:21,4/17/2015 19:21
bikeid,SEA00362,SEA00362
tripduration,17990.7,17990.7
from_station_name,6th Ave S & S King St,6th Ave S & S King St
to_station_name,Westlake Ave & 6th Ave,Westlake Ave & 6th Ave
from_station_id,ID-04,ID-04
to_station_id,SLU-15,SLU-15
usertype,,Short-Term Pass Holder


**It looks like the only difference is that one record has a value for `user_type` and the other doesn't**

Drop the record without `user_type`

In [25]:
trip_weather.drop(50792, inplace = True)

## 6. Create columns for lat & long values for the from- and to- stations

In [26]:
trip_weather = trip_weather.merge(data['station'][['station_id','lat','long']],
                                  left_on = 'from_station_id', right_on = 'station_id', how='left')\
                           .merge(data['station'][['station_id','lat','long']],
                                  left_on = 'to_station_id', right_on = 'station_id', how = 'left')

trip_weather.rename(columns = {'lat_x': 'from_lat', 'long_x': 'from_lon', 'lat_y': 'to_lat','long_y': 'to_long'},
                    inplace = True)

trip_weather.drop(['station_id_x', 'station_id_y'], axis = 1, inplace = True)

In [27]:
trip_weather.head(2).T

Unnamed: 0,0,1
trip_id,431,432
starttime,10/13/2014 10:31,10/13/2014 10:32
stoptime,10/13/2014 10:48,10/13/2014 10:48
bikeid,SEA00298,SEA00195
tripduration,985.935,926.375
from_station_name,2nd Ave & Spring St,2nd Ave & Spring St
to_station_name,Occidental Park / Occidental Ave S & S Washing...,Occidental Park / Occidental Ave S & S Washing...
from_station_id,CBD-06,CBD-06
to_station_id,PS-04,PS-04
usertype,Member,Member


## 7. Write a function to round all `tripduration` values to the nearest half second increment and then round all the values in the data

In [28]:
def round_to_half(x):
    return round(x*2) / 2

In [29]:
trip_weather['tripduration'] = trip_weather.tripduration.apply(lambda x: round_to_half(x))

## 8. Verify that `trip_duration` matches the timestamps to within 60 seconds

### Convert start and stop time columns to datetime objects

In [30]:
trip_weather.starttime = pd.to_datetime(trip_weather.starttime)
trip_weather.stoptime = pd.to_datetime(trip_weather.stoptime)

### Create a list of computed trip durations in seconds and use the math library to check if all values are within 60 seconds of eachother

In [31]:
trip_durations = [round_to_half((r.stoptime - r.starttime).seconds) for r in trip_weather.itertuples()]
all([math.isclose(a,b, rel_tol=60) for a,b in zip(trip_durations, trip_weather.tripduration)])

True

## 9. Something is wrong with the `Max_Gust_Speed_MPH` column. Identify and correct the problem, then save the data.

The data summaries printed after the first problem shows that Max_Gust_Speed_MPH appears in the  Object type summary, not the numeric. The top value is also '`-`'. This is the cause of the problem. Change these all to Nan using `np.nan`, then change the column type to float.

In [32]:
# replace dash with nan
trip_weather.Max_Gust_Speed_MPH = trip_weather.Max_Gust_Speed_MPH.apply(lambda x: x if isinstance(x, float) 
                                                                        else np.nan)
# change column dtype to float
trip_weather.Max_Gust_Speed_MPH = trip_weather.Max_Gust_Speed_MPH.astype(float)

In [33]:
trip_weather.iloc[:, 29].dtype

dtype('float64')

### Save the trip_weather data as trip_cleaned.csv

In [34]:
trip_weather.to_csv('cycle_share/trip_cleaned.csv', index = False)

# Cleaning: Movies

This data set contains 28 attributes related to various movie titles that have been scraped from IMDb. The set is supposed to contain unique titles for each record, where each record has the following attributes:

"movie_title" "color" "num_critic_for_reviews" "movie_facebook_likes" "duration" "director_name" "director_facebook_likes" "actor_3_name" "actor_3_facebook_likes" "actor_2_name" "actor_2_facebook_likes" "actor_1_name" "actor_1_facebook_likes" "gross" "genres" "num_voted_users" "cast_total_facebook_likes" "facenumber_in_poster" "plot_keywords" "movie_imdb_link" "num_user_for_reviews" "language" "country" "content_rating" "budget" "title_year" "imdb_score" "aspect_ratio"

The original set is available kaggle ([here](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset))

In [35]:
# Import movies data
movies = pd.read_csv('movies/movies_data.csv')

In [36]:
# Check data
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
movie_title                  5043 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-

## 1. Check for and correct similar values in `color`, `language`,  and `country`

In [37]:
for col in ['color','language','country']:
    print(col, '\n')
    print(movies[col].value_counts().sort_index(), '\n\n')

color 

Black and White     206
Color              4799
black and white       3
color                16
Name: color, dtype: int64 


language 

Aboriginal       2
Arabic           5
Aramaic          1
Bosnian          1
Cantonese       11
Chinese          3
Czech            1
Danish           5
Dari             2
Dutch            4
Dzongkha         1
English       4704
Filipino         1
French          73
German          19
Greek            1
Hebrew           5
Hindi           28
Hungarian        1
Icelandic        2
Indonesian       2
Italian         11
Japanese        18
Kannada          1
Kazakh           1
Korean           8
Mandarin        26
Maya             1
Mongolian        1
None             2
Norwegian        4
Panjabi          1
Persian          4
Polish           4
Portuguese       8
Romanian         2
Russian         11
Slovenian        1
Spanish         40
Swahili          1
Swedish          5
Tamil            1
Telugu           1
Thai             3
Urdu             1
V

Only the color column needs to be corrected. Both color categories have a few versions with the incorrect case.

In [38]:
movies.loc[(movies.color == 'color'), 'color'] = 'Color'
movies.loc[(movies.color == 'black and white'), 'color'] = 'Black and White'
movies.color.value_counts()

Color              4815
Black and White     209
Name: color, dtype: int64

## 2. Create a function that detects and lists non-numeric columns containing values with leading or trailing whitespace. Remove the whitespace in these columns.

### Check for whitespace in columns

In [39]:
# check for leading or trailing whitespace
def has_whitespace(data, cols):
    whitespace = []
    for col in cols:
        for x in data[col]:
            # in case encounter null values that can't split
            try:
                l = x.split(' ')
                if (l[0] == '') | (l[-1] == ''):
                    # has leading or trailing whitespace
                    print('{} has whitespace'.format(col))
                    whitespace.append(col)
                    break
            except Exception:
                continue
    return whitespace

In [40]:
# get list of all non-numeric columns
str_cols = movies.select_dtypes(include=['O']).columns

# get those with whitespace
whitespace = has_whitespace(movies, str_cols)
whitespace

director_name has whitespace
actor_2_name has whitespace
movie_title has whitespace


['director_name', 'actor_2_name', 'movie_title']

Strip this space using pandas apply method and Series.str.strip()

In [41]:
movies[whitespace] = movies[whitespace].apply(lambda x: x.str.strip())

In [42]:
# check for whitespace again
has_whitespace(movies, str_cols)

[]

## 3. Remove duplicate records. Inspect any remaining duplicate movie titles.

### Remove duplicate rows

In [43]:
# Drop duplicates and check shape
movies.drop_duplicates(inplace = True)
movies.shape

(4998, 28)

### Inspect any remaining duplicate movie titles

In [44]:
# compare number of unique movie titles with length of the data
movies.movie_title.nunique(), len(movies)

(4916, 4998)

### Explore the duplicates titles and drop duplicate records

In [45]:
# examine stats for counts per title
movies.movie_title.value_counts().describe()

count    4916.000000
mean        1.016680
std         0.132763
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         3.000000
Name: movie_title, dtype: float64

In [46]:
# list of titles with counts
movie_counts = movies.movie_title.value_counts()

# select titles with count > 1 (duplicates)
potential_dups = movie_counts[movie_counts > 1].index
potential_dups

Index(['Home', 'King Kong', 'Ben-Hur', 'Day of the Dead', 'The Host',
       'Goosebumps', 'The Return of the Living Dead', 'Lucky Number Slevin',
       'The Karate Kid', 'Teenage Mutant Ninja Turtles', 'Brothers', 'Heist',
       'First Blood', 'The Jungle Book', 'Chasing Liberty', 'Jack Reacher',
       'Ghostbusters', 'The Great Gatsby', 'Dekalog', 'The Unborn', 'Carrie',
       'Eddie the Eagle', 'Casino Royale', 'Creepshow', 'Disturbia',
       'Cinderella', 'Conan the Barbarian', 'The Lovers',
       'Oz the Great and Powerful', 'The Island', 'Halloween', 'The Gift',
       'Spider-Man 3', 'The Lovely Bones', 'The Dead Zone', 'Sabotage',
       'Alice in Wonderland', 'Planet of the Apes', 'The Gambler',
       'The Texas Chain Saw Massacre', 'Murder by Numbers', 'Skyfall',
       'Lolita', 'The Tourist', 'Exodus: Gods and Kings', 'Precious',
       'Point Break', 'Clash of the Titans', 'Glory', 'Syriana', 'RoboCop',
       'Unknown', 'Victor Frankenstein', 'A Nightmare on Elm St

## 4. Create a function that returns two arrays: one for titles that are truly duplicated, and  one for duplicated titles are not the same movie.
* hint: do this by comparing the imdb link values

In [47]:
def movie_duplicates(movies, potential):
    # subset tells pandas what columns to consider when determining duplicates
    subset=['movie_title','movie_imdb_link']
    
    # a boolean mask for indexing duplicated titles
    dup_mask = movies.duplicated(subset = subset)
    
    # get duplicated titles
    duplicated = movies.loc[dup_mask].movie_title.unique()
    
    # get the titles from mask that are not in duplicated
    not_duplicated = Series(potential)[~Series(potential).isin(duplicated)].values
    
    return duplicated, not_duplicated

In [48]:
dup, not_dup = movie_duplicates(movies, potential_dups)
not_dup

array(['The Host', 'The Dead Zone', 'Out of the Blue'], dtype=object)

## 5. Alter the names of duplicate titles that are different movies so each is unique. Then drop all duplicate rows based on movie title.

In [49]:
# iterate through titles, renaming non-duplicate titles
for m in not_dup:
    # enumerate indices of titles
    for n, idx in enumerate(movies[movies.movie_title == m].index):
        # append '_n' to end of titles
        movies.loc[idx, 'movie_title'] = m + '_{}'.format(n)

# drop duplicate movies
movies.drop_duplicates(subset = ['movie_title'], inplace = True)

Ensure there are no longer duplicate titles

In [50]:
# list of titles with counts
movies.movie_title.value_counts().head()

Alexander's Ragtime Band             1
Beverly Hills Chihuahua              1
Supernova                            1
Iron Man 3                           1
A Thin Line Between Love and Hate    1
Name: movie_title, dtype: int64

## 6. Create a series that ranks actors by proportion of movies they have appeared in

In [51]:
# get value counts for actors in each of the actor columns
a1 = DF(movies.actor_1_name.value_counts())
a2 = DF(movies.actor_2_name.value_counts())
a3 = DF(movies.actor_3_name.value_counts())

In [52]:
# merge all of these using an outer join (not all actors in all 3 columns, want to keep all)
a_all = a1.merge(a2, how='outer', left_index = True, right_index = True)\
    .merge(a3, how='outer', left_index = True, right_index = True)

In [53]:
# sum across axis 1 to get the total for each actor
actor_counts = a_all.sum(axis = 1)

# create the ranks by dividing each actor total by total number of movies
actor_ranks = (actor_counts/len(movies)).sort_values(ascending = False)
actor_ranks.head()

Robert De Niro    0.010775
Morgan Freeman    0.008742
Bruce Willis      0.007725
Matt Damon        0.007522
Johnny Depp       0.007319
dtype: float64

## 7. Create a table that contains the first and last years each actor appeared, and their length of history. Then include columns for the actors proportion and total number of movies.
* length is number of years they have appeared in movies

In [54]:
# create group objects for each actor column
g1 = movies.groupby('actor_1_name')
g2 = movies.groupby('actor_2_name')
g3 = movies.groupby('actor_3_name')


# use apply method to get dataframes with first and last years for each actor
hists = {}
for i,g in enumerate([g1, g2, g3]):
    k = 'g{}'.format(i)
    hists[k] = g.apply(lambda x: Series({'last': x['title_year'].max(),
                                         'first': x['title_year'].min()}))

# preview results
hists['g0'].head()

Unnamed: 0_level_0,first,last
actor_1_name,Unnamed: 1_level_1,Unnamed: 2_level_1
50 Cent,2005.0,2005.0
A.J. Buckley,2015.0,2015.0
Aaliyah,2002.0,2002.0
Aasif Mandvi,2008.0,2008.0
Abbie Cornish,2009.0,2012.0


**Merge all history tables**

In [55]:
history = hists['g0'].merge(hists['g1'], how='outer', left_index = True, right_index = True) \
    .merge(hists['g2'], how='outer', left_index = True, right_index = True)

# compute years from max - min of each row
actor_hist = history.apply(lambda r: Series({'first': r.min(),
                                       'last':r.max(),
                                       'years': r.max() - r.min()}),
                    axis=1).sort_values(by='years', ascending = False)

# preview results
actor_hist.head()

Unnamed: 0,first,last,years
Laurence Olivier,1940.0,2004.0,64.0
Debbie Reynolds,1952.0,2012.0,60.0
Marlon Brando,1951.0,2006.0,55.0
Dean Stockwell,1947.0,2001.0,54.0
Robert Duvall,1962.0,2014.0,52.0


### Add columns for number and proportion of movies

In [56]:
actor_hist['movie_prop'] = actor_ranks
actor_hist['movie_count'] = round(actor_ranks*len(movies))
actor_hist.head()

Unnamed: 0,first,last,years,movie_prop,movie_count
Laurence Olivier,1940.0,2004.0,64.0,0.001016,5.0
Debbie Reynolds,1952.0,2012.0,60.0,0.000813,4.0
Marlon Brando,1951.0,2006.0,55.0,0.00183,9.0
Dean Stockwell,1947.0,2001.0,54.0,0.001016,5.0
Robert Duvall,1962.0,2014.0,52.0,0.004879,24.0


## 8. Create a column that gives each movie an integer ranking based on gross sales
* 1 should indicate the highest gross
* If more than one movie has equal sales, assign all the lowest rank in the group
* The next rank after this group should increase only by 1

Using pandas `rank` method, setting ascending to false (highest sales gets rank 1), and using `method='dense'`.

This last part is what ensures not only does a group of titles sharing equal sales get the same rank (smallest in group), but it also tells the ranking to increase the next higher number by only one, instead of that rank being group rank + number in group.

In [57]:
movies['movie_sales_rank'] = movies.gross.rank(method = 'min', ascending = False)

In [58]:
# notice the inverse relation between rank and gross
movies[['gross', 'movie_sales_rank']].sort_values(by = 'gross', ascending = False).head(10)

Unnamed: 0,gross,movie_sales_rank
0,760505847.0,1.0
26,658672302.0,2.0
29,652177271.0,3.0
17,623279547.0,4.0
66,533316061.0,5.0
240,474544677.0,6.0
3024,460935665.0,7.0
8,458991599.0,8.0
3,448130642.0,9.0
582,436471036.0,10.0
