In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Problem in Question: 
Maximizing Annual Membership**

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are
geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to
any other station in the system anytime.

Cyclistic has two types of customers:
1. Customers with single ride or full-day passes (Casual Riders)
2. Customers with annual memberships (Annual Membership holders)

**Strategy:** Rather than creating a marketing campaign that targets all-new customers, Lily Moreno (marketing manager at Cycliistic) believes there is a very good
chance to convert casual riders into members.


**Task to meet the desired goal:**
1. understanding how annual members and casual riders differ
2. why casual riders would buy a membership
3. how digital media could affect their marketing tactics

We have got Cyclistic annual bike trip data ([/kaggle/input/cyclistic/Divvy_Trips_2020_Q1.csv](http://)) to analyze trends and design the marketing strategy.

**About Data**

For the purpose of this case study, the datasets are appropriate and will enable us to answer the business questions. The data has been made available by Motivate International Inc. 

You can download the data using this link [Cyclistic Data](https://divvy-tripdata.s3.amazonaws.com/index.html)

This is public data that we can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit us from using riders’ personally identifiable information. This means that we won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

In [None]:
#lets import libraries required to perform analysis on our dataset
import pandas as pd
import seaborn as sns 
import numpy as np
import matplotlib as plt 
import matplotlib.pyplot as plt

In [None]:
#importing our dataset and storing it in the variable df_cyclistic 
#analyzing attributes of dataset using head() function
df_cyclistic = pd.read_csv("/kaggle/input/cyclistic/Divvy_Trips_2020_Q1.csv")
df_cyclistic.head()

In [None]:
#get an insight into distribution of data
df_cyclistic.describe()

In [None]:
#getting superfacial view of data attributes
df_cyclistic.info()

### Preprocessing

**Overall, data seems to be cleaned having columns appropriately named but datatypes for some columns need to be redefined to perform smooth calculations.**

**Before continuing with our dataset, lets do some cleaning. Though the cleaning process can be performed in-situ, it is advisable to check for any null enteries or anomaly in the dataset.** 

In [None]:
#lets see if our dataframe has any null or empty cells to be cleaned
null_val = df_cyclistic[df_cyclistic.isnull().any(axis=1)]
null_val

In [None]:
#the above row has four null values, it is better to remove it than replacing with other val
#and verify if it has been removed or not
df_cyclistic=df_cyclistic.dropna()
null_val = df_cyclistic[df_cyclistic.isnull().any(axis=1)]
null_val

**All null enteries have been removed**

In [None]:
#passing a dictionary with the appropriate datatypes for columns 
dat_types = {
    'start_station_name': 'str',
    'end_station_name': 'str',
    'end_station_id': 'int64',
    'member_type' : 'category'
}
for col, dat_types in dat_types.items():
    df_cyclistic[col] = df_cyclistic[col].astype(dat_types)
    

In [None]:
#analyze frequencies of attributes
df_cyclistic.nunique()

In [None]:
#Now, lets calcualte the time duaration of rides by subtracting started_at from ended_at
#and, store it in a new column named ride_length
df_cyclistic['ride_length'] = pd.to_datetime(df_cyclistic['ended_at'], format = '%m/%d/%Y %H:%M')- pd.to_datetime(df_cyclistic['started_at'], format= '%m/%d/%Y %H:%M')
df_cyclistic.head()

In [None]:
#how casual and membership holders use rides
df_new = pd.DataFrame(df_cyclistic.groupby('member_type')['ride_length'].mean()).reset_index()
df_new

**It can be clearly observed from the mean (ride_length) for both user types that the avg. ride duration of a member is 12 minutes. On the other hand, the avg. ride time of a casual user is 1 hr 35 min. Surely, members are daily commuters and casual riders would use the bikes for specific purpose.**

In [None]:
srt = df_cyclistic.sort_values(by= 'ride_length', ascending = False)
new_f = pd.DataFrame(srt[['member_type', 'ride_length']])
new_f.head(50)


**Also, among the riders who use the cyclistic the most are casual members with ride duration spanning over days. So, we have a great potential to convert them into regular commuters. Among first 50 users who use the bikes more are casual riders.**

**Now, lets dig deep into how both users use bikes over a week.**

In [None]:
#lets have some more insights abouth the day of week each rider started his journey on 
#store it as day_of_week column in our datarame
#first create a dictionary for coverting days of week into their respective names
day_mapping = {
    
    0: 'Monday',
    1: 'Tuesday',
    2: 'Wednesday',
    3: 'Thursday',
    4: 'Friday',
    5: 'Saturday',
    6: 'Sunday'
}

df_cyclistic['started_at'] = pd.to_datetime(df_cyclistic['started_at'])
df_cyclistic['day_of_week'] = df_cyclistic['started_at'].dt.dayofweek.map(day_mapping)
df_cyclistic.head()


In [None]:
#lets calculate average ride length with respect to member type and day of the week
df_m = pd.DataFrame(df_cyclistic.groupby(['day_of_week', 'member_type'])['ride_length'].mean())
df_m.sort_values(by = 'ride_length', ascending = False)

**The above table illustrates records the longest ride duration for casual riders Thursday and most of riders who have memberships rides longer on Sundays.**

In [None]:
# Group by 'day_of_week' and 'member_type', then count the occurrences of 'ride_id'
df_counts = df_cyclistic.groupby(['day_of_week', 'member_type'])['ride_id'].count().sort_values(ascending = False)
# Create separate DataFrames for 'Casual' and 'Member'
df_casual = df_counts[df_counts.index.get_level_values('member_type') == 'casual'].reset_index()
df_member = df_counts[df_counts.index.get_level_values('member_type') == 'member'].reset_index()

# plot the resulting DataFrames
fig = plt.figure(figsize=(12, 6))

ax0 = fig.add_subplot(1, 2, 1)
ax1 = fig.add_subplot(1, 2, 2)

# Annual member usage trend over the week
df_member.plot(kind='bar', x='day_of_week', y='ride_id', ax=ax0, legend=False)
ax0.set_title('Annual Members')
ax0.set_xlabel('Day of Week')
ax0.set_ylabel('No. of Riders')

# Casual riders bike using trend over a week
df_casual.plot(kind='bar', x='day_of_week', y='ride_id', ax=ax1, legend=False)
ax1.set_title('Casual Riders')
ax1.set_xlabel('Day of Week')
ax1.set_ylabel('No. of Riders')

plt.show()


**It can be inffered from above plots that casual riders ride most on Sundays and least at the start of week. On the contrary, annual members rides mostly on weekdays and less on weekends.**

In [None]:
#lets get insights about the distance travelled by the users. 
#For this, we need to define 'haversine' function, and calculate distance using latitude and longitude.

from math import radians, sin, cos, sqrt, atan2
def haversine(start_lat, start_lng, end_lat, end_lng):
#now, converting our lat and long from degree to radians
     start_lat, start_lng, end_lat, end_lng = map(radians, [start_lat, start_lng, end_lat, end_lng])
     dlon = end_lng - start_lng
     dlat = end_lat - start_lat
#haversine formula
     a = sin(dlat/2)**2 + cos(start_lat) * cos(end_lat) * sin(dlon/2)**2
     c = 2 * atan2(sqrt(a), sqrt(1-a))
#radius of earth
     radius = 6371.0
# Calculate the distance
     distance = radius * c
     return distance






In [None]:
#calling the haversine function and saving the result as a new column in our dataframe
df_cyclistic['distance_traveled'] = df_cyclistic.apply(lambda row: haversine(row['start_lat'], row['start_lng'], row['end_lat'], row['end_lng']), axis=1)
dist = pd.DataFrame(df_cyclistic.groupby('member_type')['distance_traveled'].mean())
dist.reset_index(inplace=True)

plt.figure(figsize=(10, 6))
plt.bar(dist['member_type'], dist['distance_traveled'], color='blue')
plt.xlabel('Member Type')
plt.ylabel('Mean Distance Traveled')
plt.title('Mean Distance Traveled by Member Type')

plt.show()

**There is no significant difference between the average distance travelled by each category, but casual riders are taking the lead here, too, by approx. 1 Km.**

In [None]:
#lets find the trend in trip route for both types of users
#for this we need to combine the start and end station points 
#then finding which route has been used the most using aggregate function

df_cyclistic['trip_route'] = df_cyclistic['start_station_name'].astype(str) + ' ' + 'to' + ' ' + df_cyclistic['end_station_name']
g = pd.DataFrame(df_cyclistic.groupby('member_type')['trip_route'].agg(lambda x: x.value_counts().idxmax()))

#now, calculate how often user has used this specific route

g_counts = df_cyclistic.groupby('member_type')['trip_route'].agg(lambda x: x.value_counts().max() if not x.empty else None)
g['frequency'] = g_counts

print(g)


**Here, we can see that most casual riders start their journey from 'HQ QR' station and then back to the same station. On the other, annual members made most of their trips from Canal St & Adams st to Michigan Ave & Washington**.

In [None]:
#lets geographically presents the most frequent trip route for both users
#to visualize geographic data wee need to import folium 
#then we create map object of folium in our case it is rid_freq
#function Map() will display the map of Chicago, US
#finally stations are represented on Chicago map using latitude and longitudes of stations
import folium
rid_freq = folium.Map()
rid_freq = folium.Map(location = [41.8781, -87.6298], zoom_start = 13)
st_loc = [[41.8899, -87.6803], [41.8793, -87.6399], [41.8840, -87.6247]]
pop_msg = ['HQ QR (casual riders hotspot)', 'Canal St & Adams St (annual members hotspot)', 'Michigan Ave & Washington St (annual members hotspot)']
for location, popup_message in zip(st_loc, pop_msg):
    folium.Marker(location=location, popup=popup_message).add_to(rid_freq)
    
# Add a PolyLine connecting New York City and Los Angeles
polyline_coordinates = [[41.8793, -87.6399], [41.8840, -87.6247]]
folium.PolyLine(polyline_coordinates, color="blue").add_to(rid_freq)

# Add an annotation to the PolyLine
annotation_location = [(41.8793 + 41.8840) / 2, (-87.6399 + -87.6247) / 2]  # Midpoint of the PolyLine
annotation_text = "Most frequent ride of annual members"
folium.Marker(location=annotation_location, popup=annotation_text, icon=folium.Icon(color='red')).add_to(rid_freq)

from IPython.display import display
display(rid_freq)

In [None]:
#lets analyze the proportion of both casual and members
plt.figure(figsize=(12,6))
prop = df_cyclistic['member_type'].value_counts()
plt.pie(prop, labels = prop.index, autopct='%1.1f%%', startangle=90)
plt.title('proportions of casual riders and membership holders')
plt.show

**So, we need to craft a marketing strategies to convince these 11% casual riders into subscribing for annual membership.** 

In [None]:
df_cyclistic['member_type'].value_counts()

### Summary
**Queastion 1: How casual riders use bikes differently from the members?**
1. It can be clearly observed from the **mean (ride_length)** for both user types that the avg. ride duration of a member is **12 minutes**. On the other hand, the avg. ride time of a casual user is **1 hr 35 min**.
2. Also, among the riders **who use the cyclistic the most** are casual riders with ride duration spanning **over days**. Among **first 50 users** who use the bikes more are casual riders. 
3. The **longest ride duration** for casual riders are recorded on **Thursday**, while most of riders who have memberships rides longer on **Sundays**. 
4. Casual riders ride most on Sundays and least at the start of week. On the contrary, annual members rides mostly on weekdays and less on weekends.
5. There is no significant difference between the **average distance** travelled by each category, but casual riders are taking the lead here, too, by approx. 1 Km.
6. It has been observed that the **HQ QR Station** was prominent among casual riders; about 3% (3764) of casual riders commute daily to and from this particular station. Annual members made most of their trips from **Canal St & Adams st to Michigan Ave & Washington**.

**Question 2: Why would a casual rider convert to a member?**
1. **Casual riders** are just 11% of the total cyclistic users, yet they still ride more than the annual members; their **avg. ride duaration is 8 times** the avg. ride duration of membership holders. They are likely to buy the membership as per their frequecy of using bikes.
2. The **ride length** of a casual rider (at least first 50) spans over days with largest being **108 days**. They must be using **full-day passes**. One strategy to convert them into regular member is **put a cap on ride duration** for casual users for a day. To use the bike above this limit, they have to buy membership. 
3. Most of the casual riders ride on Sunday. Company can provide **incentives or discounts** to regular members specifically on Sunday to allure the casual rider into buying membership. They will buy the membership for they are the ones who use bike the most on Sunday.
4. Lastly, it is observed that approx. 3% of the casual riders use the HQ QR station for daily commute. Cyclistic can target these specfic users **offering them special offers**.

Question 3: Digital marketing strategy raise annual subscriptions?

To convert existing casual riders, Cyclistic need to initialize a specific remarketing campaign tailored for casual riders only.

* First of all, the new **awareness campaign** should be launched via **e-mails**, **personal messages**, and **social media** informing the casual riders of benefits of being an annual members.
* Create **attractive ads** to post on social media listing **offers and incentives** if they convert into members.
* Second, inform the casual riders about the limit company is going to apply on time duration on full-day passes with invitation for them to subsribing into membership in order to **ride unlimited**. 
* **Promotions and incentives** campaign specifically for riders who rides on Sunday. **Discounts for annual members on Sunday** to convince casual riders into having a membership.
* Lastly, prepare a **detailed list of casual commuters who use HQ QR station** the most. Approach them with deals to convert them into members as they are already regular commuters. 

It is all remarketing campaign to make casual users aware of what perks they can enjoy if they get permoted to annual members. Moreover, marketing strategy need to be shifted from general awareness campaign to more specific. 