# Introduction to Exploritary Data Analysis
Using Pandas!

This notebook goes along with a tutorial that can be found on the [Medallion Data Science youtube channel](https://www.youtube.com/channel/UCxladMszXan-jfgzyeIMyvw). Click the link and subscribe for future tutorials.

# Step 0: Imports and Reading Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
plt.style.use('ggplot')
pd.set_option('max_columns', 200)

In [2]:
df = pd.read_csv('../input/rollercoaster-database/coaster_db.csv')

## Step 1: Data Understanding
- Dataframe `shape`
- `head` and `tail`
- `dtypes`
- `describe`


In [3]:
df.shape

(1087, 56)

In [4]:
df.head(5)

Unnamed: 0,coaster_name,Length,Speed,Location,Status,Opening date,Type,Manufacturer,Height restriction,Model,Height,Inversions,Lift/launch system,Cost,Trains,Park section,Duration,Capacity,G-force,Designer,Max vertical angle,Drop,Soft opening date,Fast Lane available,Replaced,Track layout,Fastrack available,Soft opening date.1,Closing date,Opened,Replaced by,Website,Flash Pass Available,Must transfer from wheelchair,Theme,Single rider line available,Restraint Style,Flash Pass available,Acceleration,Restraints,Name,year_introduced,latitude,longitude,Type_Main,opening_date_clean,speed1,speed2,speed1_value,speed1_unit,speed_mph,height_value,height_unit,height_ft,Inversions_clean,Gforce_clean
0,Switchback Railway,600 ft (180 m),6 mph (9.7 km/h),Coney Island,Removed,"June 16, 1884",Wood,LaMarcus Adna Thompson,,Lift Packed,50 ft (15 m),,gravity,,,Coney Island Cyclone Site,1:00,1600 riders per hour,2.9,LaMarcus Adna Thompson,30°,43 ft (13 m),,,,Gravity pulled coaster,,,,,,,,,,,,,,,,1884,40.574,-73.978,Wood,1884-06-16,6 mph,9.7 km/h,6.0,mph,6.0,50.0,ft,,0,2.9
1,Flip Flap Railway,,,Sea Lion Park,Removed,1895,Wood,Lina Beecher,,,,1.0,,,a single car. Riders are arranged 1 across in ...,,,,12.0,Lina Beecher,,,,,,,,,1902.0,,,,,,,,,,,,,1895,40.578,-73.979,Wood,1895-01-01,,,,,,,,,1,12.0
2,Switchback Railway (Euclid Beach Park),,,"Cleveland, Ohio, United States",Closed,,Other,,,,,,,,,,,,,,,,,,,,,,,1895.0,,,,,,,,,,,,1896,41.58,-81.57,Other,,,,,,,,,,0,
3,Loop the Loop (Coney Island),,,Other,Removed,1901,Steel,Edwin Prescott,,,,1.0,,,a single car. Riders are arranged 2 across in ...,,,,,Edward A. Green,,,,,Switchback Railway,,,,1910.0,,Giant Racer,,,,,,,,,,,1901,40.5745,-73.978,Steel,1901-01-01,,,,,,,,,1,
4,Loop the Loop (Young's Pier),,,Other,Removed,1901,Steel,Edwin Prescott,,,,1.0,,,,,,,,Edward A. Green,,,,,,,,,1912.0,,,,,,,,,,,,,1901,39.3538,-74.4342,Steel,1901-01-01,,,,,,,,,1,


In [9]:
df.drop(['Status'], axis=1)

Unnamed: 0,coaster_name,Length,Speed,Location,Opening date,Type,Manufacturer,Height restriction,Model,Height,Inversions,Lift/launch system,Cost,Trains,Park section,Duration,Capacity,G-force,Designer,Max vertical angle,Drop,Soft opening date,Fast Lane available,Replaced,Track layout,Fastrack available,Soft opening date.1,Closing date,Opened,Replaced by,Website,Flash Pass Available,Must transfer from wheelchair,Theme,Single rider line available,Restraint Style,Flash Pass available,Acceleration,Restraints,Name,year_introduced,latitude,longitude,Type_Main,opening_date_clean,speed1,speed2,speed1_value,speed1_unit,speed_mph,height_value,height_unit,height_ft,Inversions_clean,Gforce_clean
0,Switchback Railway,600 ft (180 m),6 mph (9.7 km/h),Coney Island,"June 16, 1884",Wood,LaMarcus Adna Thompson,,Lift Packed,50 ft (15 m),,gravity,,,Coney Island Cyclone Site,1:00,1600 riders per hour,2.9,LaMarcus Adna Thompson,30°,43 ft (13 m),,,,Gravity pulled coaster,,,,,,,,,,,,,,,,1884,40.5740,-73.9780,Wood,1884-06-16,6 mph,9.7 km/h,6.0,mph,6.0,50.0,ft,,0,2.9
1,Flip Flap Railway,,,Sea Lion Park,1895,Wood,Lina Beecher,,,,1.0,,,a single car. Riders are arranged 1 across in ...,,,,12,Lina Beecher,,,,,,,,,1902,,,,,,,,,,,,,1895,40.5780,-73.9790,Wood,1895-01-01,,,,,,,,,1,12.0
2,Switchback Railway (Euclid Beach Park),,,"Cleveland, Ohio, United States",,Other,,,,,,,,,,,,,,,,,,,,,,,1895,,,,,,,,,,,,1896,41.5800,-81.5700,Other,,,,,,,,,,0,
3,Loop the Loop (Coney Island),,,Other,1901,Steel,Edwin Prescott,,,,1.0,,,a single car. Riders are arranged 2 across in ...,,,,,Edward A. Green,,,,,Switchback Railway,,,,1910,,Giant Racer,,,,,,,,,,,1901,40.5745,-73.9780,Steel,1901-01-01,,,,,,,,,1,
4,Loop the Loop (Young's Pier),,,Other,1901,Steel,Edwin Prescott,,,,1.0,,,,,,,,Edward A. Green,,,,,,,,,1912,,,,,,,,,,,,,1901,39.3538,-74.4342,Steel,1901-01-01,,,,,,,,,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1082,American Dreier Looping,"3,444 ft (1,050 m)",53 mph (85 km/h),Other,,Steel,Anton Schwarzkopf,55 in (140 cm),,111 ft (34 m),3.0,Booster Wheel Lift Hill,,3 trains with 5 cars. Riders are arranged 2 ac...,,,"1,500 riders per hour",4.7,Werner Stengel,,,,,,,,,,,,,,,,,,,,,,2022,,,Steel,,53 mph,85 km/h,53.0,mph,53.0,111.0,ft,,3,4.7
1083,Pantheon (roller coaster),"3,328 ft (1,014 m)",73 mph (117 km/h),Busch Gardens Williamsburg,2022,Steel – Launched,Intamin,,Blitz Coaster,178 ft (54 m),2.0,LSM,,2 trains with 5 cars. Riders are arranged 2 ac...,,,,,,95°,,,,,,,,,,,,,,,,,,,,,2022,37.2339,-76.6426,Steel,2022-01-01,73 mph,117 km/h,73.0,mph,73.0,178.0,ft,,2,
1084,Tron Lightcycle Power Run,"3,169.3 ft (966.0 m)",59.3[1] mph (95.4 km/h),Other,"June 16, 2016",Steel – Launched,Vekoma,4[2] ft (122 cm),Motorbike roller coaster,78.1 ft (23.8 m),0.0,LIM Launch,,7 trains with 7 cars. Riders are arranged 2 ac...,,~2:00,"1,680 riders per hour",4,Walt Disney Imagineering,,,,,,,,,,,,,,Must transfer from wheelchair,Tron,Single rider line available,,,,,TRON Lightcycle / Run,2022,,,Steel,2016-06-16,59.3 mph,95.4 km/h,59.3,mph,59.3,78.1,ft,,0,4.0
1085,Tumbili,770 ft (230 m),34 mph (55 km/h),Kings Dominion,,Steel – 4th Dimension – Wing Coaster,S&S – Sansei Technologies,,4D Free Spin,112 ft (34 m),0.0,Vertical chain lift hill,,Single car trains with riders arranged 4 acros...,Jungle X-Pedition,0:55,,,,,,,,The Crypt,,,,,,,Official website,,,,,,,,,,2022,,,Steel,,34 mph,55 km/h,34.0,mph,34.0,112.0,ft,,0,


In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.describe()

# Step 2: Data Preperation
- Dropping irrelevant columns and rows
- Identifying duplicated columns
- Renaming Columns
- Feature Creation

In [None]:
# Example of dropping columns
# df.drop(['Opening date'], axis=1)

In [None]:
df = df[['coaster_name',
    # 'Length', 'Speed',
    'Location', 'Status',
    # 'Opening date',
    #   'Type',
    'Manufacturer',
#     'Height restriction', 'Model', 'Height',
#        'Inversions', 'Lift/launch system', 'Cost', 'Trains', 'Park section',
#        'Duration', 'Capacity', 'G-force', 'Designer', 'Max vertical angle',
#        'Drop', 'Soft opening date', 'Fast Lane available', 'Replaced',
#        'Track layout', 'Fastrack available', 'Soft opening date.1',
#        'Closing date',
#     'Opened', 
    # 'Replaced by', 'Website',
#        'Flash Pass Available', 'Must transfer from wheelchair', 'Theme',
#        'Single rider line available', 'Restraint Style',
#        'Flash Pass available', 'Acceleration', 'Restraints', 'Name',
       'year_introduced',
        'latitude', 'longitude',
    'Type_Main',
       'opening_date_clean',
    #'speed1', 'speed2', 'speed1_value', 'speed1_unit',
       'speed_mph', 
    #'height_value', 'height_unit',
    'height_ft',
       'Inversions_clean', 'Gforce_clean']].copy()

In [None]:
df['opening_date_clean'] = pd.to_datetime(df['opening_date_clean'])

In [None]:
# Rename our columns
df = df.rename(columns={'coaster_name':'Coaster_Name',
                   'year_introduced':'Year_Introduced',
                   'opening_date_clean':'Opening_Date',
                   'speed_mph':'Speed_mph',
                   'height_ft':'Height_ft',
                   'Inversions_clean':'Inversions',
                   'Gforce_clean':'Gforce'})

In [None]:
df.isna().sum()

In [None]:
df.loc[df.duplicated()]

In [None]:
# Check for duplicate coaster name
df.loc[df.duplicated(subset=['Coaster_Name'])].head(5)

In [None]:
# Checking an example duplicate
df.query('Coaster_Name == "Crystal Beach Cyclone"')

In [None]:
df.columns

In [None]:
df = df.loc[~df.duplicated(subset=['Coaster_Name','Location','Opening_Date'])] \
    .reset_index(drop=True).copy()

# Step 3: Feature Understanding
(Univariate analysis)

- Plotting Feature Distributions
    - Histogram
    - KDE
    - Boxplot

In [None]:
df['Year_Introduced'].value_counts()

In [None]:
ax = df['Year_Introduced'].value_counts() \
    .head(10) \
    .plot(kind='bar', title='Top 10 Years Coasters Introduced')
ax.set_xlabel('Year Introduced')
ax.set_ylabel('Count')

In [None]:
ax = df['Speed_mph'].plot(kind='hist',
                          bins=20,
                          title='Coaster Speed (mph)')
ax.set_xlabel('Speed (mph)')

In [None]:
ax = df['Speed_mph'].plot(kind='kde',
                          title='Coaster Speed (mph)')
ax.set_xlabel('Speed (mph)')

In [None]:
df['Type_Main'].value_counts()

# Step 4: Feature Relationships
- Scatterplot
- Heatmap Correlation
- Pairplot
- Groupby comparisons

In [None]:
df.plot(kind='scatter',
        x='Speed_mph',
        y='Height_ft',
        title='Coaster Speed vs. Height')
plt.show()

In [None]:
ax = sns.scatterplot(x='Speed_mph',
                y='Height_ft',
                hue='Year_Introduced',
                data=df)
ax.set_title('Coaster Speed vs. Height')
plt.show()

In [None]:
sns.pairplot(df,
             vars=['Year_Introduced','Speed_mph',
                   'Height_ft','Inversions','Gforce'],
            hue='Type_Main')
plt.show()

In [None]:
df_corr = df[['Year_Introduced','Speed_mph',
    'Height_ft','Inversions','Gforce']].dropna().corr()
df_corr

In [None]:
sns.heatmap(df_corr, annot=True)

# Step 5: Ask a Question about the data
- Try to answer a question you have about the data using a plot or statistic.

What are the locations with the fastest roller coasters (minimum of 10)?

In [None]:
ax = df.query('Location != "Other"') \
    .groupby('Location')['Speed_mph'] \
    .agg(['mean','count']) \
    .query('count >= 10') \
    .sort_values('mean')['mean'] \
    .plot(kind='barh', figsize=(12, 5), title='Average Coast Speed by Location')
ax.set_xlabel('Average Coaster Speed')
plt.show()