# Numeric data or ... ?

In this exercise, and throughout this chapter, you'll be working with bicycle ride sharing data in San Francisco called ride_sharing. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The user_type column contains information on whether a user is taking a free ride and takes on the following values:

1. for free riders.
2. for pay per ride.
3. for monthly subscribers.

In this instance, you will print the information of ride_sharing using .info() and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset. The pandas package is imported as pd.


* Print the information of ride_sharing.
* Use .describe() to print the summary statistics of the user_type column from ride_sharing
* Convert user_type into categorical by assigning it the 'category' data type and store it in the user_type_cat column.
* Make sure you converted user_type_cat correctly by using an assert statement.


In [3]:
import pandas as pd

ride_sharing = pd.read_csv("/kaggle/input/chicago-divvy-bicycle-sharing-data/data.csv")

ride_sharing.head()

Unnamed: 0,trip_id,year,month,week,day,hour,usertype,gender,starttime,stoptime,...,from_station_id,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_id,to_station_name,latitude_end,longitude_end,dpcapacity_end
0,2355134,2014,6,27,0,23,Subscriber,Male,2014-06-30 23:57:00,2014-07-01 00:07:00,...,131,Lincoln Ave & Belmont Ave,41.939365,-87.668385,15.0,303,Broadway & Cornelia Ave,41.945512,-87.64598,15.0
1,2355133,2014,6,27,0,23,Subscriber,Male,2014-06-30 23:56:00,2014-07-01 00:00:00,...,282,Halsted St & Maxwell St,41.86458,-87.64693,15.0,22,May St & Taylor St,41.869482,-87.655486,15.0
2,2355130,2014,6,27,0,23,Subscriber,Male,2014-06-30 23:33:00,2014-06-30 23:35:00,...,327,Sheffield Ave & Webster Ave,41.921687,-87.653714,19.0,225,Halsted St & Dickens Ave,41.919936,-87.64883,15.0
3,2355129,2014,6,27,0,23,Subscriber,Female,2014-06-30 23:26:00,2014-07-01 00:24:00,...,134,Peoria St & Jackson Blvd,41.877749,-87.649633,19.0,194,State St & Wacker Dr,41.887155,-87.62775,11.0
4,2355128,2014,6,27,0,23,Subscriber,Female,2014-06-30 23:16:00,2014-06-30 23:26:00,...,320,Loomis St & Lexington St,41.872187,-87.661501,15.0,134,Peoria St & Jackson Blvd,41.877749,-87.649633,19.0


In [4]:
ride_sharing.columns

Index(['trip_id', 'year', 'month', 'week', 'day', 'hour', 'usertype', 'gender',
       'starttime', 'stoptime', 'tripduration', 'temperature', 'events',
       'from_station_id', 'from_station_name', 'latitude_start',
       'longitude_start', 'dpcapacity_start', 'to_station_id',
       'to_station_name', 'latitude_end', 'longitude_end', 'dpcapacity_end'],
      dtype='object')

In [9]:
ride_sharing['usertype']

0          Subscriber
1          Subscriber
2          Subscriber
3          Subscriber
4          Subscriber
              ...    
9495230    Subscriber
9495231    Subscriber
9495232    Subscriber
9495233    Subscriber
9495234    Subscriber
Name: usertype, Length: 9495235, dtype: object

In [6]:
# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['usertype'].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9495235 entries, 0 to 9495234
Data columns (total 23 columns):
 #   Column             Dtype  
---  ------             -----  
 0   trip_id            int64  
 1   year               int64  
 2   month              int64  
 3   week               int64  
 4   day                int64  
 5   hour               int64  
 6   usertype           object 
 7   gender             object 
 8   starttime          object 
 9   stoptime           object 
 10  tripduration       float64
 11  temperature        float64
 12  events             object 
 13  from_station_id    int64  
 14  from_station_name  object 
 15  latitude_start     float64
 16  longitude_start    float64
 17  dpcapacity_start   float64
 18  to_station_id      int64  
 19  to_station_name    object 
 20  latitude_end       float64
 21  longitude_end      float64
 22  dpcapacity_end     float64
dtypes: float64(8), int64(8), object(7)
memory usage: 1.6+ GB
None
count        9495235

In [7]:
# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['usertype'].describe())

# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['usertype'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9495235 entries, 0 to 9495234
Data columns (total 23 columns):
 #   Column             Dtype  
---  ------             -----  
 0   trip_id            int64  
 1   year               int64  
 2   month              int64  
 3   week               int64  
 4   day                int64  
 5   hour               int64  
 6   usertype           object 
 7   gender             object 
 8   starttime          object 
 9   stoptime           object 
 10  tripduration       float64
 11  temperature        float64
 12  events             object 
 13  from_station_id    int64  
 14  from_station_name  object 
 15  latitude_start     float64
 16  longitude_start    float64
 17  dpcapacity_start   float64
 18  to_station_id      int64  
 19  to_station_name    object 
 20  latitude_end       float64
 21  longitude_end      float64
 22  dpcapacity_end     float64
dtypes: float64(8), int64(8), object(7)
memory usage: 1.6+ GB
None
count        9495235

### Summing strings and concatenating numbers

In the previous exercise, you were able to identify that category is the correct data type for user_type and convert it in order to extract relevant statistical summaries that shed light on the distribution of user_type.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column duration to the type int. Before that however, you will need to make sure to strip "minutes" from the column in order to make sure pandas reads it as numerical. The pandas package has been imported as pd.



* Use the .strip() method to strip duration of "minutes" and store it in the duration_trim column.
* Convert duration_trim to int and store it in the duration_time column.
* Write an assert statement that checks if duration_time's data type is now an int.Print the average ride duration.


In [8]:
ride_sharing['tripduration']

0          10.066667
1           4.383333
2           2.100000
3          58.016667
4          10.633333
             ...    
9495230    11.066667
9495231    11.033333
9495232    13.950000
9495233     6.016667
9495234    12.350000
Name: tripduration, Length: 9495235, dtype: float64

In [11]:
# # Strip duration of minutes
# ride_sharing['duration_trim'] = ride_sharing['tripduration'].str.strip('minutes')

# # Convert duration to integer
# ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype(int)
# # Write an assert statement making sure of conversion
# assert ride_sharing['duration_time'].dtype == 'int'

# # Print formed columns and calculate average ride duration 
# print(ride_sharing[['duration','duration_trim','duration_time']])
# print(ride_sharing['duration_time'].mean())

In [15]:
# Remove the word 'minutes' from tripduration
ride_sharing['duration_trim'] = ride_sharing['tripduration'].astype(str).str.replace(' minutes', '', regex=False)

# Convert duration_trim to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype(float).astype(int)

# Assert the conversion to integer
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration
print(ride_sharing[['tripduration', 'duration_trim', 'duration_time']])
print("Mean Duration Ride Time :",ride_sharing['duration_time'].mean())


         tripduration       duration_trim  duration_time
0           10.066667  10.066666666666666             10
1            4.383333   4.383333333333334              4
2            2.100000                 2.1              2
3           58.016667  58.016666666666666             58
4           10.633333  10.633333333333333             10
...               ...                 ...            ...
9495230     11.066667  11.066666666666666             11
9495231     11.033333  11.033333333333331             11
9495232     13.950000               13.95             13
9495233      6.016667   6.016666666666667              6
9495234     12.350000               12.35             12

[9495235 rows x 3 columns]
Mean Duration Ride Time : 10.956872789351712


Great work! 11 minutes is really not bad for an average ride duration in a city like San-Francisco. In the next lesson, you're going to jump right ahead into sanity checking the range of values in your data.

# Tire size constraints

In this lesson, you're going to build on top of the work you've been doing with the ride_sharing DataFrame. You'll be working with the tire_sizes column which contains data on each bike's tire size.

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

In this exercise, you will make sure the tire_sizes column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes.


* Convert the tire_sizes column from category to 'int'.
* Use .loc[] to set all values of tire_sizes above 27 to 27.
* Reconvert back tire_sizes to 'category' from int.
* Print the description of the tire_sizes.





In [19]:
# Convert tire_sizes to integer
ride_sharing['month'] = ride_sharing['month'].astype('int')

# Set all values above 27 to 27mo
ride_sharing.loc[ride_sharing["month"] > 7, "month"] = 7

# Reconvert tire_sizes back to categorical
ride_sharing['month'] = ride_sharing['month'].astype('category')

# Print tire size description
print(ride_sharing['month'].describe())

count     9495235
unique          6
top             6
freq      6988344
Name: month, dtype: int64


Awesome work! You can look at the new maximum by looking at the top row in the description. Notice how essential it was to convert tire_sizes into integer before setting a new maximum.

### Back to the future
A new update to the data pipeline feeding into the ride_sharing DataFrame has been updated to register each ride's date. This information is stored in the ride_date column of the type object, which represents strings in pandas.

A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the ride_date column that occur anytime in the future, and set the maximum possible value of this column to today's date. Before doing so, you would need to convert ride_date to a datetime object.

The datetime package has been imported as dt, alongside all the packages you've been using till now.



* Convert ride_date to a datetime object using to_datetime(), then convert the datetime object into a date and store it in ride_dt column.
* Create the variable today, which stores today's date by using the dt.date.today() function.
* For all instances of ride_dt in the future, set them to today's date.
* Print the maximum date in the ride_dt column
.

In [20]:
ride_sharing.columns

Index(['trip_id', 'year', 'month', 'week', 'day', 'hour', 'usertype', 'gender',
       'starttime', 'stoptime', 'tripduration', 'temperature', 'events',
       'from_station_id', 'from_station_name', 'latitude_start',
       'longitude_start', 'dpcapacity_start', 'to_station_id',
       'to_station_name', 'latitude_end', 'longitude_end', 'dpcapacity_end',
       'user_type_cat', 'duration_time', 'duration_trim'],
      dtype='object')

In [23]:
import pandas as pd

# Replace invalid 'day' values (e.g., 0) with 1
ride_sharing['day'] = ride_sharing['day'].replace(0, 1)

# Create a new column 'ride_date' by combining 'year', 'month', and 'day'
ride_sharing['ride_date'] = pd.to_datetime(
    ride_sharing[['year', 'month', 'day']]
)

# Print the first few rows to verify
print(ride_sharing[['year', 'month', 'day', 'ride_date']].tail())


         year month  day  ride_date
9495230  2017     6    6 2017-06-06
9495231  2017     6    6 2017-06-06
9495232  2017     6    6 2017-06-06
9495233  2017     6    6 2017-06-06
9495234  2017     6    6 2017-06-06


In [25]:
import datetime as dt

# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

2017-06-06


Great job! Imagine counting the number of rides taken today without having cleaned your ranges correctly. You would have wildly underreported your findings!

# Finding duplicates

A new update to the data pipeline feeding into ride_sharing has added the ride_id column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates. A sample of ride_sharing is in your environment, as well as all the packages you've been working with thus far.



* Find duplicated rows of ride_id in the ride_sharing DataFrame while setting keep to False.
* Subset ride_sharing on duplicates and sort by ride_id and assign the results to duplicated_rides.
* Print the ride_id, duration and user_birth_year columns of duplicated_rides in that order.


In [26]:
# Find duplicates
duplicates = ride_sharing.duplicated("trip_id", keep = False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values(by = 'trip_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['trip_id','duration_time','user_type_cat']])

          trip_id  duration_time user_type_cat
5679603  10958572              8    Subscriber
5679602  10958572              8    Subscriber
5653373  10999878              8    Subscriber
5653372  10999878              8    Subscriber
5649026  11006910              7    Subscriber
...           ...            ...           ...
5250864  11693784              6    Subscriber
5246907  11707860             19    Subscriber
5246906  11707860             19    Subscriber
5245997  11710838             19    Subscriber
5245996  11710838             19    Subscriber

[94 rows x 3 columns]


Notice that trips are duplicated.


### Treating duplicates
In the last exercise, you were able to verify that the new update feeding into ride_sharing contains a bug generating both complete and incomplete duplicated rows for some values of the ride_id column, with occasional discrepant values for the user_birth_year and duration columns.

In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration, and the minimum user_birth_year for each set of incomplete duplicate rows.


* Drop complete duplicates in ride_sharing and store the results in ride_dup.
* Create the statistics dictionary which holds minimum aggregation for user_birth_year and mean aggregation for duration.
* Drop incomplete duplicates by grouping by ride_id and applying the aggregation in statistics.
* Find duplicates again and run the assert statement to verify de-duplication.





In [None]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0