# Cycle Brand Persona Determination using Python - Case Study

We are assigned with the huge task of determining the brand persona
for a new cycle share scheme. The goal is to determine the brand persona in order to create the best marketing strategies

## Question Prompt

The cycle sharing scheme provides means for the people of the city to commute
using a convenient, cheap, and green transportation alternative. The service has 500
bikes at 50 stations across Seattle. Each of the stations has a dock locking system (where
all bikes are parked); kiosks (so customers can get a membership key or pay for a trip);
and a helmet rental service. A person can choose between purchasing a membership
key or short-term pass. A membership key entitles an annual membership, and the key
can be obtained from a kiosk. Advantages for members include quick retrieval of bikes
and unlimited 45-minute rentals. Short-term passes offer access to bikes for a 24-hour
or 3-day time interval. Riders can avail and return the bikes at any of the 50 stations
citywide.

## Issue that needs to be addressed 

In spite of the expansion of the company, Customer retention has always been an issue. As for increasing the customer base, we have to decide upon a marketing channel that guarantees broad reach on low cost incurred. We are provided with the dataset of transaction history. In order to create an effective marketing strategy, we need to understand the persona of our customers. We need to answer questions like "Which attribute correlates the best with trip duration and number of trips?
Which age generation adapts the most to our service?" etc. 

## Preliminary Analysis

In [None]:
# importing the packages needed

import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import numpy as np
import scipy
from scipy import stats
import seaborn

In [None]:
# reading the csv file
data = pd.read_csv("../input/cycle-trips-dataset/cycle_trips_dataset.csv")

As a first step, determining the size of the dataset and seeing the top 5 observations helps us to get a glimpse of how our data looks

In [None]:
print(len(data))         # len is used to check the length
data.head(5)            # head fucntion is used to view the topmost observations of the dataset

We need to determine the type of each column whether it is an integer, string, etc. We will use the dtypes command

In [None]:
data.dtypes

##  Univariate analysis

This is the analysis performed on a single variable and thus does not account for any sort of relationship among exploratory variables. We perform univariate analysis on the dataset to better understand the features in isolation

we want to print the date range that starts from the first value of starttime and ends with the last value of stoptime

In [None]:
data = data.sort_values(by='starttime')
data.reset_index()
print ('Date range of dataset: %s - %s' % (data.loc[1, 'starttime'], data.loc[len(data)-1, 'stoptime']))

We can draw two important insights from above:
    
    1. the data ranges from October 2014 up till September 2016 (i.e., three years of data)
    2. it seems like the cycle sharing service is usually operational beyond the standard 9 to 5 business hours.

### Type of Memberships

The next step is to determine whether more people prefer buying short term pass or long term rentals.

To determine that we need to plot a bar graph of trip frequencies by user type

In [None]:
groupby_user = data.groupby('usertype').size()
groupby_user.plot.bar(title = 'Distribution of User by membership')

Conclusion : Users prefer memberships more than the Short-term pass. It must be the case that new users would be short-term pass holders however once they try out the service and
become satisfied would ultimately avail the membership to receive the perks and benefits
offered.

### Distribution By gender

The next task is to determine whether more males or females use the cycle service or they both equally utilize it

In [None]:
groupby_gender = data.groupby('gender').size()
groupby_gender.plot.bar(title = 'Distribution of Trips by Gender')

Conclusion : In conclusion, we can say that the number of Males that utilize the cycle service is more than three times that of the number of females.

### Target Age Group

We need to know more about the target customers to whom to company’s marketing message will be targetted to. For that we can create an age-wise distribution of our data in order to know the demographics in a better way

In [None]:
data = data.sort_values(by = 'birthyear')
groupby_birthyear = data.groupby('birthyear').size()
groupby_birthyear.plot.bar(title = 'Distribution by Birth Year', figsize = (15,4))


Conclusion : We can see that most of the people belong to the birthyear 1980 - 1990 i.e. they belonged to the Gen Y (also known as millennials). Recent reports published by Elite Daily and CrowdTwist which said that millennials are the most loyal generation Hto their favorite brands. 

hence we can say that most of the millennials would be members rather than being short-term pass holders. In order to check this notion, look at the following code.

In [None]:
data_mil = data[(data['birthyear'] >= 1977) & (data['birthyear'] <= 1994)]
groupby_mil = data_mil.groupby('usertype').size()
groupby_mil.plot.bar(title = 'Distibution of millenial user type')

Conclusion : The notion stated by the reports appear to be true as all the millennials are members rather than being short-term card holders

## Multivariate Analysis 

Multivariate analysis refers to incorporation of multiple exploratory variables to
understand the behavior of a response variable. This seems to be the most feasible
and realistic approach considering the fact that entities within this world are usually
interconnected. Thus the variability in response variable might be affected by the
variability in the interconnected exploratory variables.

We want to validate if we have the field birthyear only for the "members" and not for "short term card holders"

In [None]:
data[data['usertype'] == 'Short-Term Pass Holder']['birthyear'].isnull().values.all()

So we see above that the birthyear is given only for the members. So the short term pass holders are not required to provide their birth years which is an inconsistency as this made our prior conclusion " Millennials are very loyal to the brands they like" incorrect.

Now we validate the same thing for gender

In [None]:
data[data['usertype'] == 'Short-Term Pass Holder']['gender'].isnull().values.all()

From the output above we can conclude that we do not have any demographic details for short term pass holders. 

#### Time series Analysis

We are interested to see as to how the frequency of trips vary across date and time. But in order to make a time series plot, we need to convert the date from string format to date-time format and thereafter we need to split the date-time into date components (year, month, day, hour, etc.)

In [None]:
List_ = list(data['starttime'])
List_ = [datetime.datetime.strptime(x, "%m/%d/%Y %H:%M") for x in List_]
data['starttime_mod'] = pd.Series(List_,index=data.index)
data['starttime_date'] = pd.Series([x.date() for x in List_],index=data.index)
data['starttime_year'] = pd.Series([x.year for x in List_],index=data.index)
data['starttime_month'] = pd.Series([x.month for x in List_],index=data.index)
data['starttime_day'] = pd.Series([x.day for x in List_],index=data.index)
data['starttime_hour'] = pd.Series([x.hour for x in List_],index=data.index)

In [None]:
data.groupby('starttime_date')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by date', figsize = (15,4))

There seems to exist a pattern in the above time series analysis

the pattern is repeating over a fixed interval of time— that is, seasonality. In fact, we can split the distribution into three distributions. One pattern is the seasonality that is repeating over time. The second one is a flat density distribution. Finally, the last pattern is the lines (that is, the hikes) over that density function. In case of time series prediction we can make estimations for a future time using both of these distributions and add up in order to predict upon a calculated confidence interval.

#### Distribution of trips by year

In [None]:
data.groupby('starttime_year')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by year', figsize = (15,4))

In the above, we can see a trend that the mean duration of trips is increasing on a yearly basis

#### Distribution of Trip by Months

In [None]:
data.groupby('starttime_month')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by date', figsize = (15,4))

We can conclude that the duration increases first, then reaches its peak in the month of JULY and then starts to decline. The highest duration of trips is in the month of July

#### Distribution of trips by day

In [None]:
data.groupby('starttime_day')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by date', figsize = (15,4))

We can conclude that there isn't any evident pattern in the trip duration by day; it is quite fluctuating

### Measuring Center of Measure

measures like mean, median, and mode help give a summary view of the features in question.

#### Trip duration Analysis

We need to know the mean and median of each trip.
Also we are interested in finding the station from which most of the trips originate, so as to run promotional schemes for the existing customers over there. For this, we will utilize the mode feature of the statistics package

In [None]:
from collections import Counter
trip_duration = list(data['tripduration'])
Station_from = list(data['from_station_name'])
data_mode = Counter(Station_from)
print('The mean duration of trip is : %f' %statistics.mean(trip_duration))
print('The median duration of trip is : %f' %statistics.median(trip_duration))
print('The Station from which most trips originate is : %s' %data_mode.most_common(1))
mean_trip_duration = statistics.mean(trip_duration)

Conclusion: Most of the trips originate from Pier 69 and Alaskan Way & Clay St. Hence this is the ideal location for running promotional campaigns targeted to existing customers
We can also see that the value of mean is quite higher than the median. Hence there might be some outliers present. We need to plot the distribution of the tripduration in order to explore this point.    

In [None]:
data['tripduration'].plot.hist(bins=100, title='Frequency distribution of Trip duration')
plt.show()

Conclusion: The extreme values on the right side are not very frequent but their extreme nature tends to increase the value of the mean

#### Box plot to determine outliers

In [None]:
box = data.boxplot(column=['tripduration'])
plt.show()

Conclusion : There are a huge number of outliers in the tripduration feature. We need to now determine the proportion of outliers to understand whether they are in majority or minority

In [None]:
q75, q25 = np.percentile(trip_duration, [75 ,25])
iqr = q75 - q25
Percent_outlier = ((len(data) - len([x for x in trip_duration if q75+(1.5*iqr) >= x >= q25-(1.5*iqr)]))*100/float(len(data)))
print ('Proportion of values as outlier: %f percent' %Percent_outlier)

Conclusion : As the data is time series data, we cannot remove the outliers. The best thing to do is apply some kind of transformation.
    In order to do that, we need to first find the mean of all the non outliers

In [None]:
List_non_outlier = list(x for x in trip_duration if q75+(1.5*iqr) >= x >= q25-(1.5*iqr))
mean_non_outlier = statistics.mean(List_non_outlier)
print('The mean of the non outliers is %f '%mean_non_outlier)

Conclusion: The mean of non-outlier trip duration values  (i.e., approximately 712) is considerably lower than that calculated in the presence of outliers  (i.e., approximately 1,203). This best describes the notion that mean is highly affected by the
presence of outliers in the dataset.

#### Function to transform outliers

In [None]:
upper_whisker = q75 + (1.5*iqr)

def transform_tripduration(x):
    if x > upper_whisker:
        return mean_trip_duration
    return x

data['tripduration_mean'] = data['tripduration'].apply(lambda x: transform_tripduration(x))
data['tripduration_mean'].plot.hist(bins=100, title='Frequency Distribution of mean transformed trip duration')
plt.show()

print ('Mean of trip duration: %f'%data['tripduration_mean'].mean())
print ('Standard deviation of trip duration: %f'%data['tripduration_mean'].std())
print ('Median of trip duration: %f'%data['tripduration_mean'].median())

#### Finding the tripduration centre of measures for MALES

In [None]:
data_males =  data[(data['gender'] == 'Male')]
trip_duration_male = list(data_males['tripduration'])
Station_from_male = list(data_males['from_station_name'])
data_mode_males = Counter(Station_from_male)
print('The mean duration of trip is : %f' %statistics.mean(trip_duration_male))
print('The median duration of trip is : %f' %statistics.median(trip_duration_male))
print('The Station from which most trips originate is : %s' %data_mode_males.most_common(1))
mean_trip_duration_male = statistics.mean(trip_duration_male)

In [None]:
data_males['tripduration'].plot.hist(bins=100, title='Frequency distribution of Trip duration of males')
plt.show()

Conclusion: The distribution seems to be slightly positively skewed. So we need to check for outliers using the box plot method

In [None]:
box_males = data_males.boxplot(column=['tripduration'])
plt.show()

Conclusion: There are a lot of outliers in male trips as well. Let us transform those outliers.

In [None]:
q75, q25 = np.percentile(trip_duration_male, [75 ,25])
iqr = q75 - q25
upper_whisker_males = q75 + (1.5*iqr)

def transform_males(x):
    if x > upper_whisker_males:
        return mean_trip_duration_male
    return x

data_males['tripduration_mean'] = data_males['tripduration'].apply(lambda x: transform_males(x))
data_males['tripduration_mean'].plot.hist(bins=100, title='Frequency Distribution of mean transformed trip duration of males')
plt.show()

print ('Mean of trip duration of males: %f'%data_males['tripduration_mean'].mean())
print ('Standard deviation of trip duration of males: %f'%data_males['tripduration_mean'].std())
print ('Median of trip duration of males: %f'%data_males['tripduration_mean'].median())

## Correlation Analysis

### Determining the strength of relationships between variables

Correlation refers to the strength and direction of the relationship between two
quantitative features. A correlation value of 1 means strong correlation in the positive
direction, whereas a correlation value of -1 means a strong correlation in the negative
direction. A value of 0 means no correlation between the quantitative features.

An interesting question to ask is whether change in age brings a change in trip duration? Let's find out using correlation

In [None]:
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
data['age'] = data['starttime_year'] - data['birthyear']
correlations = data[['tripduration','age']].corr(method='pearson')
print(correlations)

the correlation came out to be weak and positive in nature hence there is no clear relation between these two

# Conclusions

Trip duration follows a definite seasonal pattern that
repeats over time. Forecasting this time series can help us predict the times when
the company needs to push its marketing efforts and times when most trips anticipated
can help ensure operational efficiencies.

As for the promotions, we now know that
the best station at which to kick off the campaign would be Pier 69/Alaskan Way & Clay St.