In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# import numpy as np # linear algebra
# import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![Bike Sharing](https://newsinfo.inquirer.net/files/2016/08/bike-share.jpg)
## Visual analysis of shared bicycle data

1. Ask a question (Business Understanding)
2. Data Understanding
    - Data collection
    - Import Data
    - View data set information
3. Data Cleaning (Data Preparation)
    - Data preprocessing
    - Feature Engineering
4. Modeling
5. Model evaluation (Evaluation)
6. Plan implementation (Deployment)
    - Submit results to Kaggle
    - Report writing

## Introduction:
### Two, concrete operation

Put a data analysis step downstairs, if you don’t panic...some order may be adjusted.

### 1. Data collection

Data download address:
[kaggle](https://www.kaggle.com/c/bike-sharing-demand)


### 2. Project background
The shared bicycle system is a way of bicycle rental, which automatically obtains bicycle rental and return data through a network of self-service terminals throughout the city. Using this system, people can rent bicycles in one place and return them to different places as needed.

The data generated by the system clearly records the length of time the car is used, the place of departure, the place of arrival, and the elapsed time. Therefore, the bicycle sharing system is used as a sensor network, which can be used to study mobility in cities. In this project, it is required to analyze the influence of factors such as the number of shared bicycle rentals and weather and time based on historical usage data to predict the demand for shared bicycle rentals in the shared bicycle plan of the US capital, Washington, DC.

![The Bikeshare Planning](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQYqLSyhZlabEFkwzudA4Ncm9-Jz8vZ_5yVcA&usqp=CAU)

### 2.1 Asking questions

Q. What are the factors that affect the use of shared bicycles?

Ans : **The Factors Affecting Bike-Sharing Demand**

This aimed to address the need for a comprehensive review on the factors affecting bike-sharing demand to bridge the gaps by deepening the knowledge on weather, built environment and land use, public transportation, station level, socio-demographic effects, temporal factors, and safety. This article evaluates recent studies on station-based bike sharing in literature and seeks answers to two main research questions: 
First, how do the weather conditions, built environment and land use, public transportation, socio-demographic attributes, temporal factors, and safety affect the bike-sharing trip demand? 

Second, what are the most commonly used factors in literature affecting trip demand? 
For this purpose, an overview of the factors affecting trip demands has been established to evaluate the performance of Bike-Share Programs(BSPs) comprehensively. The results can provide reliable estimate for planners or decision-makers in understanding the key factors contributing to bike-sharing demand. The information obtained from this overview can also be a guideline for BSP planners, policymakers and researchers to improve the efficiency of BSPs.

### 3. Understand the data

In [None]:
# Import the data analysis package, polish the steel gun on the battlefield
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline

# Ignore warning
import warnings
warnings.filterwarnings('ignore')

# Import training data set
train = pd.read_csv('../input/bike-sharing-demand/train.csv')
# Import test data set
test = pd.read_csv('../input/bike-sharing-demand/test.csv')
print('Training data set:',train.shape,'Test data set:',test.shape)

In [None]:
# Merging data sets
full = train.append( test , ignore_index = True )

print('The combined data set:',full.shape)

In [None]:
# View data set
full.head()

### Feature description:

**Let's explore the data.**

- datetime - hourly date + timestamp
- season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp - temperature in Celsius
- atemp - "feels like" temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals

In [None]:
# View the data type and total number of data in each column to see if the data has missing values
full.info()

**From here, we can see that the test data set has fewer three columns: casual, registered, and count compared to the training data set. We need to make predictions through modeling. Here we only consider data visualization and will not do modeling analysis for the time being.**

In [None]:
# Get the description statistics of the data type column
full.describe()

### 4. Data visualization analysis

1. Numerical type: (use directly)
    - temp: actual temperature
    - atemp: body temperature
    - humidity: humidity
    - windspeed: wind speed
    - casual: the number of bikes rented by unregistered users
    - registered: Number of registered users rented bikes
    - count: total number of rental bikes
2. Time series:
    - datetime: Change to a single year, month, day, hour, and week
3. Categorized data: (Replace categories with numerical values ​​and perform One-hot encoding)
    - season: season. 1: Spring; 2: Summer; 3: Autumn; 4: Winter
    - holiday: Whether it is a holiday. 0: No; 1: Yes
    - workingday: Whether it is a working day. 0: No; 1: Yes
    - weather: weather. 1: sunny; 2: cloudy; 3: light rain or snow; 4: severe weather

In [None]:
# Split year, month, day, hour
full['date'] = full.datetime.apply( lambda a : a.split( )[0]) 
full['year'] = full.datetime.apply( lambda a : a.split( )[0].split('-')[0]).astype('int')
full['month'] = full.datetime.apply( lambda a : a.split( )[0].split('-')[1]).astype('int')
full['day'] = full.datetime.apply( lambda a : a.split( )[0].split('-')[2]).astype('int')
full['weekend'] = full.date.apply( lambda a : datetime.strptime( a , '%Y-%m-%d').isoweekday())
full['hour'] = full.datetime.apply( lambda a : a.split( )[1].split(':')[0]).astype('int')
# Delete datetime
full.drop('datetime' , axis = 1 , inplace = True)
full.head()

In [None]:
# After the data is standardized, calculate the correlation coefficient
corrDf = full.corr()

# View the correlation coefficient of each feature and count
corrDf['count'].sort_values(ascending = False)

**It can be seen from the correlation coefficient that humidity, temp, and atemp have a significant impact on count. Among them, the correlation coefficient between temp and atemp is very close to count. Therefore, we can only choose temp for analysis; year, month, season, windspeed, and weather have an effect on count. There is also a significant impact, and the correlation coefficients between workingday, weekend, holiday and count are extremely small.**

In [None]:
# In order to show the influence of all features more intuitively, make a heat map of the correlation coefficient
import seaborn as sn

df = pd.concat([full.iloc[:, -5:].astype(int), train.iloc[:, 1:]], axis=1)
corrDf = df.corr()
mask = np.array(corrDf)
mask[np.tril_indices_from(mask)] = False
fig = plt.figure(figsize=(16, 16))
sn.heatmap(corrDf, mask=mask, annot=True, square=True)

**Next, analyze in depth the influence of each feature on count, and visualize each feature**

In [None]:
# ① Time dimension--year
sn.boxplot(full['year'] , full['count'])
plt.title('The influence of year')
plt.show()

# ② Time dimension--month
sn.pointplot(full['month'] , full['count'])
plt.title('The influence of month')
plt.show()

# ③ Time dimension--season
sn.boxplot(full['season'] , full['count'])
plt.title('The influence of season')
plt.show()

# ④ Time dimension--hour
sn.pointplot(full['hour'] , full['count'])
plt.title('The influence of hour')
plt.show()

### to sum up:

   - ① The number of leases in 2012 was higher than that in 2011, indicating that over time, shared bicycles have become more and more familiar and accepted by more people, and the number of users has gradually increased.
   - ② It can be seen that the monthly impact on the number of shared bicycle rentals is more obvious, increasing month by month from June to October, maintaining near the maximum value from June to October, and decreasing month by month from 10 to 12, showing strong seasonality.
   - ③ The number of users is autumn > summer > winter > spring. The number of spring is less than that of winter. This seems to be different from what we usually understand. It may be that the temperature rise in the United States is too slow in spring, which leads to this result. As for whether it is affected by temperature The impact is still affected by factors such as humidity, wind speed, weather, etc. This requires us to further analyze the characteristics of temperature, humidity, customs, and weather.
   - ④ There are two peaks in the number of leases at around 8 o'clock and around 17:00, and there is obviously a peak period for commuting. But there is a question here, is it the same on rest days? To know that we still need to compare and analyze the two characteristics of working days and rest days.

**Note**

Box plot use the IQR method for finding display data and outliers.

- Wikipedia Definition

The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.

We will clear the outliers values.
 - Okay, let's check!

In [None]:
# Weather factors
sn.boxplot(train['weather'] , train['count'])
plt.title('The influence of weather')
plt.show()

In [None]:
# Temperature, humidity, wind speed factors
cols = ['temp' , 'atemp' , 'humidity' , 'windspeed' , 'count']
sn.pairplot(full[cols])
plt.show()

**Make a correlation diagram between multiple continuous variables, you can compare the relationship between any two continuous variables. It can be clearly seen in the figure that temp and atemp are roughly linear, but there is also a set of data that deviates significantly from the linear correlation trend, which may be related to humidity and wind speed. Therefore, it can be considered that temp, humidity and windspeed jointly determine atemp, so the atemp feature can be deleted in the subsequent modeling process.**

**Further study the influence of temperature, humidity and wind speed on the number of leases:**

In [None]:
fig , axes = plt.subplots(1,3,figsize = (24,8))
ax1 = plt.subplot(1,3,1)
ax2 = plt.subplot(1,3,2)
ax3 = plt.subplot(1,3,3)
sn.regplot(train['temp'] , train['count'] , ax = ax1)
sn.regplot(train['humidity'] , train['count'] , ax = ax2)
sn.regplot(train['windspeed'] , train['count'] , ax = ax3)
ax1.set_title('The influence of temp')
ax2.set_title('The influence of humidity')
ax3.set_title('The influence of windspeed')

**Although the influence of the three weather factors on the number of leases is relatively scattered, it can be clearly seen that temperature and wind speed are positively correlated with the number of leases, and humidity is negatively related to the number of leases.**

**Next, we will analyze the impact of weeks, holidays, and working days**

In [None]:
fig, axes = plt.subplots(2,1,figsize = (16, 10))
ax1 = plt.subplot(2,1,1)
sn.pointplot(full['hour'] , full['count'] , hue = full['weekend'] , ax = ax1)
ax1.set_title('The influence of hour(weekday)')

ax2 = plt.subplot(2,2,3)
sn.pointplot(full['hour'] , full['count'] , hue = full['workingday'] , ax = ax2)
ax2.set_title('The influence of hour(workingday)')

ax3 = plt.subplot(2,2,4)
sn.pointplot(full['hour'] , full['count'] , hue = full['holiday'] , ax = ax3)
ax3.set_title('The influence of hour(holiday)')

**It can be seen that the rental volume is high in the morning and evening peak hours of working days, and the rental volume is low during the rest of the day; the rental volume is higher at noon and afternoon on holidays, which is in line with the law of people traveling by car on holidays.**

![Thanks](https://static1.squarespace.com/static/548b4aa0e4b08dc696411c49/t/549655c3e4b0ce400b9a201b/1419138500229/?format=1500w)