In [1]:
#Import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

In [2]:
#creating path to import the files
path = r'C:\Users\rbrand\New York Citi Bikes'

In [4]:
#Importing cb_clean_data.csv data set from prepared data.
cb_clean_data = pd.read_csv(os.path.join(path, 'Data', 'Prepared Data', 'cb_clean_data.csv'), index_col=False)

We are going to proceed with a basic understanding of the data with the following analysis:
    -1.Variable definitions: Where understanding the meaing of each variable in the dataset is important for the analysis (numerical, categorical, datim), values, and any units of measurements.
    -2. Data Structure: Checkign the same of the dataset (number of rows and columns), and preview the first few rows to understand the structure and ofrmat of the data.
    -3.Descriptive Statistics: Working on summary statistics for numerical variables (mean, median, standart deviation, min, max) to really understand the distribution. For categorical variables, we can count the frequency of each category..
    -4.Visualizations: Creating a few visualizations to explore the distribution of numeral and categorical variables. 

In [7]:
#1.Reviewing the Variable Definitions
# You can print the column names and their data types
print("Variable Definitions:")
print(cb_clean_data.dtypes)

Variable Definitions:
Unnamed: 0                   int64
trip_id                     object
bike_id                      int64
weekday                     object
start_hour                   int64
start_time                  object
start_station_id             int64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_time                    object
end_station_id               int64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
trip_duration                int64
subscriber                  object
birth_year                 float64
gender                       int64
trip_duration_minutes      float64
dtype: object


These variable definitions provide an overview of the columns in the dataset, their data types, and their potential meanings. The dataframe meet the minimum requirments to proceed with further analysis.

In [8]:
#2.Exploring the data structure
print("\nDataset Shape:", cb_clean_data.shape)
print("\nFirst Few Rows:")
print(cb_clean_data.head())


Dataset Shape: (50000, 20)

First Few Rows:
   Unnamed: 0 trip_id  bike_id weekday  start_hour       start_time  \
0           0  LnQzQk    16013     Mon          18   9/9/2013 18:18   
1           1  IL9boN    15230     Thu          18  9/12/2013 18:38   
2           2  46clGB    17942     Wed          19  9/18/2013 19:44   
3           3  v7vdFt    19683     Sat          11  9/28/2013 11:54   
4           4  VGBsb5    18024     Sat          18   9/7/2013 18:08   

   start_station_id       start_station_name  start_station_latitude  \
0               523          W 38 St & 8 Ave               40.754666   
1               257  Lispenard St & Broadway               40.719392   
2               479          9 Ave & W 45 St               40.760193   
3               527          E 33 St & 1 Ave               40.743156   
4               521          8 Ave & W 31 St               40.750450   

   start_station_longitude         end_time  end_station_id  \
0               -73.991382   9/9

In [9]:
#3.Descriptive Statistics
print("\nSummary Statistics for Numerical Variables:")
print(cb_clean_data.describe())

print("\nFrequency of Categories in Categorical Variables:")
print(cb_clean_data['weekday'].value_counts())


Summary Statistics for Numerical Variables:
         Unnamed: 0       bike_id    start_hour  start_station_id  \
count  50000.000000  50000.000000  50000.000000      50000.000000   
mean   24999.500000  17615.269360     14.145240        443.321500   
std    14433.901067   1675.407446      4.860541        356.559925   
min        0.000000  14556.000000      0.000000         72.000000   
25%    12499.750000  16188.000000     10.000000        304.000000   
50%    24999.500000  17584.000000     15.000000        402.000000   
75%    37499.250000  19014.000000     18.000000        484.000000   
max    49999.000000  20642.000000     23.000000       3002.000000   

       start_station_latitude  start_station_longitude  end_station_id  \
count            50000.000000             50000.000000    50000.000000   
mean                40.734170               -73.991109      442.539700   
std                  0.019911                 0.012555      355.756022   
min                 40.680342        

- Trip Duration: The mean trip duration is approximately 839 seconds (about 13.98 minutes), with a standard deviation of approximately 574 seconds. The shortest trip duration is 60 seconds, while the longest is 2697 seconds. There seems to be some variability in trip durations, with a wide range of values.

- Start and End Stations: The start and end station IDs indicate a range of stations used by the riders, with some stations appearing more frequently than others.

- Start Hour: Most trips seem to start around midday, with a mean start hour of approximately 14:14 (2:14 PM). The standard deviation suggests some variability in the start times.

- User Birth Year: The mean birth year of users is around 1976. The minimum birth year is 1899, and the maximum is 1997. There might be some outliers or errors in birth year data, considering the minimum birth year.(1899 needs to be cleaned)

- Gender: The gender column shows that most users are coded as 1, with a smaller number coded as 2. This might indicate binary gender categories, with a potential outlier or error coded as 1.

- Trip Duration in Minutes: After converting trip durations to minutes, the mean trip duration is approximately 13.98 minutes, with a standard deviation of approximately 9.56 minutes.

Potential Areas to Tackle:

- Outliers in Trip Duration: There may be some outliers in trip duration that could skew the analysis. These outliers could be investigated further to determine if they are valid data points or if they should be treated as errors.

- Handling Missing or Incorrect Birth Years: The minimum birth year of 1899 seems unlikely and may indicate missing or incorrect data. Further investigation is needed to handle such cases, such as imputation or data validation.

- Start and End Stations Analysis: Further analysis of the start and end stations' usage patterns could provide insights into popular routes or areas, potentially leading to improvements in bike availability or infrastructure.

These are some potential areas to explore and tackle based on the descriptive statistics provided. I'm going to proceed with further analysis and a second data cleaning to ensure the accuracy and reliability of the dataset for subsequent analyses.

In [23]:
#Calculating the median birth year
median_birth_year = cb_clean_data['birth_year'].median()
print("Median birth year:", median_birth_year)

Median birth year: 1978.0


In [27]:
# Replacing incorrect birth years of 1899 with NaN
cb_clean_data['birth_year'] = cb_clean_data['birth_year'].replace(1899, np.nan)

In [28]:
# Replacing remaining NaN values with the median birth year
cb_clean_data['birth_year'].fillna(median_birth_year, inplace=True)

In [29]:
# Verify the changes
print(cb_clean_data['birth_year'].describe())

count    50000.000000
mean      1975.971540
std         10.272815
min       1900.000000
25%       1970.000000
50%       1978.000000
75%       1983.000000
max       1997.000000
Name: birth_year, dtype: float64


In [35]:
# Displaying the first 50 minimum values of the birth_year column since we got another outliner year 1900 as min value.
first_50_min_values = cb_clean_data['birth_year'].nsmallest(50)
print(first_50_min_values)


563      1900.0
1371     1900.0
1754     1900.0
7502     1900.0
16252    1900.0
18332    1900.0
28568    1900.0
40359    1900.0
49684    1900.0
1431     1901.0
13081    1901.0
17130    1901.0
44309    1901.0
48989    1901.0
19824    1910.0
27899    1917.0
49185    1921.0
306      1922.0
3548     1922.0
48755    1922.0
40229    1924.0
33470    1926.0
22944    1929.0
513      1932.0
9646     1932.0
12066    1932.0
12309    1932.0
27000    1932.0
28777    1932.0
35908    1932.0
36325    1932.0
43065    1932.0
46465    1932.0
3683     1933.0
7804     1933.0
8796     1933.0
12233    1933.0
47079    1933.0
8998     1934.0
26269    1934.0
2650     1935.0
7180     1935.0
10618    1935.0
11823    1935.0
14243    1935.0
15930    1935.0
18016    1935.0
26899    1935.0
27347    1935.0
29065    1935.0
Name: birth_year, dtype: float64


Looks like there are still some inconsistancies on the birth year.Let's set the earliest plausible birth year as 1943, assuming that the users must be not older than 70 years to participate in the bike sharing program. We'll replace any birth years earlier than 1943 with NaN.

In [36]:
#Setting a threshold for the earliest plausible birth year
earliest_plausible_birth_year = 1943

In [37]:
# Replacing birth years earlier than the threshold with NaN
cb_clean_data.loc[cb_clean_data['birth_year'] < earliest_plausible_birth_year, 'birth_year'] = np.nan

In [38]:
# Replacing NaN values with the median birth year
cb_clean_data['birth_year'].fillna(median_birth_year, inplace=True)

In [39]:
# Verifying the changes
print(cb_clean_data['birth_year'].describe())

count    50000.000000
mean      1976.107060
std          9.984958
min       1943.000000
25%       1970.000000
50%       1978.000000
75%       1983.000000
max       1997.000000
Name: birth_year, dtype: float64


In [40]:
#Exporting cb_clean_data2. 
cb_clean_data.to_csv(os.path.join(path, 'Data', 'Prepared Data', 'cb_clean_data2.csv'))