# Common data types

Manipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow.

When working with new data, you should always check the data types of your columns using the `.dtypes` attribute or the `.info()` method which you'll see in the next exercise. Often times, you'll run into columns that should be converted to different data types before starting any analysis.

In this exercise, you'll first identify different types of data and correctly map them to their respective types.

<center><img src="images/01.01.png"  style="width: 400px, height: 300px;"/></center>

# Numeric data or ... ?

In this exercise, and throughout this chapter, you'll be working with bicycle ride sharing data in San Francisco called `ride_sharing`. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The `user_type` column contains information on whether a user is taking a free ride and takes on the following values:

- 1 for free riders.
- 2 for pay per ride.
- 3 for monthly subscribers.

In this instance, you will print the information of `ride_sharing` using `.info()` and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset. The `pandas` package is imported as `pd`.

In [2]:
import pandas as pd

ride_sharing = pd.read_csv("dataset/ride_sharing_new.csv")

# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       25760 non-null  int64 
 1   duration         25760 non-null  object
 2   station_A_id     25760 non-null  int64 
 3   station_A_name   25760 non-null  object
 4   station_B_id     25760 non-null  int64 
 5   station_B_name   25760 non-null  object
 6   bike_id          25760 non-null  int64 
 7   user_type        25760 non-null  int64 
 8   user_birth_year  25760 non-null  int64 
 9   user_gender      25760 non-null  object
dtypes: int64(6), object(4)
memory usage: 2.0+ MB
None
count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64


In [6]:
# Print the information of ride_sharing
print(ride_sharing.info())





<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Unnamed: 0       25760 non-null  int64   
 1   duration         25760 non-null  object  
 2   station_A_id     25760 non-null  int64   
 3   station_A_name   25760 non-null  object  
 4   station_B_id     25760 non-null  int64   
 5   station_B_name   25760 non-null  object  
 6   bike_id          25760 non-null  int64   
 7   user_type        25760 non-null  int64   
 8   user_birth_year  25760 non-null  int64   
 9   user_gender      25760 non-null  object  
 10  user_type_cat    25760 non-null  category
dtypes: category(1), int64(6), object(4)
memory usage: 2.0+ MB
None


In [7]:
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64


In [5]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dtype: int64


# Summing strings and concatenating numbers

In the previous exercise, you were able to identify that `category` is the correct data type for `user_type` and convert it in order to extract relevant statistical summaries that shed light on the distribution of `user_type`.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column `duration` to the type `int`. Before that however, you will need to make sure to strip "`minutes`" from the column in order to make sure `pandas` reads it as numerical. The `pandas` package has been imported as `pd`.

In [8]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip("minutes")

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing[['duration','duration_trim','duration_time']].mean())

         duration duration_trim  duration_time
0      12 minutes           12              12
1      24 minutes           24              24
2       8 minutes            8               8
3       4 minutes            4               4
4      11 minutes           11              11
...           ...           ...            ...
25755  11 minutes           11              11
25756  10 minutes           10              10
25757  14 minutes           14              14
25758  14 minutes           14              14
25759  29 minutes           29              29

[25760 rows x 3 columns]
duration_time    11.389053
dtype: float64


  print(ride_sharing[['duration','duration_trim','duration_time']].mean())


In [10]:
ride_sharing.head(2)

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,user_type_cat,duration_trim,duration_time
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male,2,12,12
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male,2,24,24


# Tire size constraints

In this lesson, you're going to build on top of the work you've been doing with the `ride_sharing` DataFrame. You'll be working with the `tire_sizes` column which contains data on each bike's tire size.

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

In this exercise, you will make sure the `tire_sizes` column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes.

In [12]:
ride_sharing.columns

Index(['Unnamed: 0', 'duration', 'station_A_id', 'station_A_name',
       'station_B_id', 'station_B_name', 'bike_id', 'user_type',
       'user_birth_year', 'user_gender', 'user_type_cat', 'duration_trim',
       'duration_time'],
      dtype='object')

In [13]:
# # Convert tire_sizes to integer
# ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# # Set all values above 27 to 27
# ride_sharing.loc[ride_sharing['tire_sizes'] > 27, :] = 27

# # Reconvert tire_sizes back to categorical
# ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# # Print tire size description
# print(ride_sharing['tire_sizes'].describe())

# Back to the future

A new update to the data pipeline feeding into the `ride_sharing` DataFrame has been updated to register each ride's date. This information is stored in the `ride_date` column of the type object, which represents strings in `pandas`.

A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the `ride_date` column that occur anytime in the future, and set the maximum possible value of this column to today's date. Before doing so, you would need to convert `ride_date` to a `datetime` object.

The `datetime` package has been imported as dt, alongside all the packages you've been using till now.

In [16]:
# import datetime as dt
# # Convert ride_date to date
# ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# # Save today's date
# today = dt.date.today()

# # Set all in the future to today's date
# ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# # Print maximum of ride_dt column
# print(ride_sharing['ride_dt'].max())

# How big is your subset?

You have the following `loans` DataFrame which contains loan and credit score data for consumers, and some metadata such as their first and last names. You want to find both complete and incomplete duplicates using `.duplicated()`.

<center><img src="images/01.02.png"  style="width: 400px, height: 300px;"/></center>

Choose the correct usage of `.duplicated()` below:

- `loans.duplicated(subset = ['first_name', 'last_name'], keep = False)`


# Finding duplicates

A new update to the data pipeline feeding into `ride_sharing` has added the `ride_id` column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the `ride_sharing` DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates. A sample of `ride_sharing` is in your environment, as well as all the packages you've been working with thus far.

In [18]:
# # Find duplicates
# duplicates = ride_sharing.duplicated(['ride_id'], keep = False)

# # Sort your duplicated rides
# duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# # Print relevant columns of duplicated_rides
# print(duplicated_rides[['ride_id','duration','user_birth_year']])

# Treating duplicates

In the last exercise, you were able to verify that the new update feeding into `ride_sharing` contains a bug generating both complete and incomplete duplicated rows for some values of the `ride_id` column, with occasional discrepant values for the `user_birth_year` and `duration` columns.

In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average `duration`, and the minimum `user_birth_year` for each set of incomplete duplicate rows.

In [19]:
# # Drop complete duplicates from ride_sharing
# ride_dup = ride_sharing.drop_duplicates()

# # Create statistics dictionary for aggregation function
# statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# # Group by ride_id and compute new statistics
# ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# # Find duplicated values again
# duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
# duplicated_rides = ride_unique[duplicates == True]

# # Assert duplicates are processed
# assert duplicated_rides.shape[0] == 0