# **서울 공유 자전거 이용에 미치는 영향 분석**



전처리

1. 변수명 수정

2. 변수 타입 수정

3. 결측치 처리

4. 중복값 처리

5. 이상치 처리

6. 데이터 정리

7. Holiday 변수 숫자로 매핑

8. 파생 변수 생성

**데이터 불러오기**

In [29]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [30]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [31]:
path0 = '/content/drive/MyDrive/에폭/SeoulBikeData.csv'

df = pd.read_csv(path0, encoding='cp949')
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(캜),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(캜),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


### **변수명 수정**

**변수명의 '캜' > '℃'로 변경**

In [32]:
df.rename(columns={'Temperature(캜)': 'Temperature(°C)', 'Dew point temperature(캜)': 'Dew point temperature(°C)'}, inplace=True)
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


**Functioning Day 변수명 변경**

변수 설명에는 Functional Day로 작성되어 있어 통일시키기

In [33]:
df.rename(columns={'Functioning Day': 'Functional Day'}, inplace=True)
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functional Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


### **변수 타입 수정하기**

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date                       8760 non-null   object 
 1   Rented Bike Count          8760 non-null   int64  
 2   Hour                       8760 non-null   int64  
 3   Temperature(°C)            8760 non-null   float64
 4   Humidity(%)                8760 non-null   int64  
 5   Wind speed (m/s)           8760 non-null   float64
 6   Visibility (10m)           8760 non-null   int64  
 7   Dew point temperature(°C)  8760 non-null   float64
 8   Solar Radiation (MJ/m2)    8760 non-null   float64
 9   Rainfall(mm)               8760 non-null   float64
 10  Snowfall (cm)              8760 non-null   float64
 11  Seasons                    8760 non-null   object 
 12  Holiday                    8760 non-null   object 
 13  Functional Day             8760 non-null   objec

날짜 변수(Date)와 시간대 변수(Hour)의 데이터 타입 변경

**(1) Date변수 데이터 타입 object에서 datetime으로 변환**

Date변수 바꿔주기(연도-월-일 형태)

현재 데이터 형식이 일/월/연도이므로 dayfirst=True

In [35]:
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True)
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functional Day
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


**(2) Hour변수 데이터타입 int64에서 object로 변환**

In [36]:
df["Hour"] = df["Hour"].astype(str)

### **결측치 처리**

In [37]:
df.isnull().sum()

Unnamed: 0,0
Date,0
Rented Bike Count,0
Hour,0
Temperature(°C),0
Humidity(%),0
Wind speed (m/s),0
Visibility (10m),0
Dew point temperature(°C),0
Solar Radiation (MJ/m2),0
Rainfall(mm),0


결측치가 없음

### **중복값 처리**

In [38]:
df.duplicated().sum()

np.int64(0)

중복값 없음

### **이상치 처리**

In [39]:
numerical_cols = df[df.columns[(df.dtypes == 'float64') | (df.dtypes == 'int64')]]

In [40]:
numerical_cols.describe()

Unnamed: 0,Rented Bike Count,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm)
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,704.602055,12.882922,58.226256,1.724909,1436.825799,4.073813,0.569111,0.148687,0.075068
std,644.997468,11.944825,20.362413,1.0363,608.298712,13.060369,0.868746,1.128193,0.436746
min,0.0,-17.8,0.0,0.0,27.0,-30.6,0.0,0.0,0.0
25%,191.0,3.5,42.0,0.9,940.0,-4.7,0.0,0.0,0.0
50%,504.5,13.7,57.0,1.5,1698.0,5.1,0.01,0.0,0.0
75%,1065.25,22.5,74.0,2.3,2000.0,14.8,0.93,0.0,0.0
max,3556.0,39.4,98.0,7.4,2000.0,27.2,3.52,35.0,8.8


### **데이터 정리**

**Functional Day 변수 NoFunc값 결측치 처리 여부 확인**

Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

In [41]:
NFD=df[df['Functional Day'] == "No"] #운영하지 않는 시간대 출력
NFD

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functional Day
3144,2018-04-11,0,0,14.4,82,4.6,1041,11.3,0.0,0.0,0.0,Spring,No Holiday,No
3145,2018-04-11,0,1,13.6,81,3.6,886,10.3,0.0,0.0,0.0,Spring,No Holiday,No
3146,2018-04-11,0,2,12.7,80,3.9,885,9.3,0.0,0.0,0.0,Spring,No Holiday,No
3147,2018-04-11,0,3,11.6,81,3.1,687,8.4,0.0,0.0,0.0,Spring,No Holiday,No
3148,2018-04-11,0,4,10.2,83,3.5,554,7.4,0.0,0.0,0.0,Spring,No Holiday,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8251,2018-11-09,0,19,11.9,71,2.7,589,6.7,0.0,0.0,0.0,Autumn,No Holiday,No
8252,2018-11-09,0,20,11.9,72,2.5,526,7.0,0.0,0.0,0.0,Autumn,No Holiday,No
8253,2018-11-09,0,21,11.4,74,1.9,498,6.9,0.0,0.0,0.0,Autumn,No Holiday,No
8254,2018-11-09,0,22,11.2,75,1.7,478,6.9,0.0,0.0,0.0,Autumn,No Holiday,No


일년 중 295개의 시간대에 운영하지 않음

In [42]:
count_per_date = NFD.groupby('Date').size().reset_index(name='count') #날짜별 운영 안한 시간대
print(count_per_date)

         Date  count
0  2018-04-11     24
1  2018-05-10     24
2  2018-09-18     24
3  2018-09-19     24
4  2018-09-28     24
5  2018-09-30     24
6  2018-10-02     24
7  2018-10-04     24
8  2018-10-06      7
9  2018-10-09     24
10 2018-11-03     24
11 2018-11-06     24
12 2018-11-09     24


In [43]:
count_per_date["DayOfWeek"] = count_per_date["Date"].dt.day_name() #요일 출력
print(count_per_date)

         Date  count  DayOfWeek
0  2018-04-11     24  Wednesday
1  2018-05-10     24   Thursday
2  2018-09-18     24    Tuesday
3  2018-09-19     24  Wednesday
4  2018-09-28     24     Friday
5  2018-09-30     24     Sunday
6  2018-10-02     24    Tuesday
7  2018-10-04     24   Thursday
8  2018-10-06      7   Saturday
9  2018-10-09     24    Tuesday
10 2018-11-03     24   Saturday
11 2018-11-06     24    Tuesday
12 2018-11-09     24     Friday


특정 요일에 운영하지 않은 것이 아님.

운영 안한 날 각 날씨 변수 평균

In [44]:
FD_date_list = count_per_date["Date"].tolist()
print(FD_date_list)

[Timestamp('2018-04-11 00:00:00'), Timestamp('2018-05-10 00:00:00'), Timestamp('2018-09-18 00:00:00'), Timestamp('2018-09-19 00:00:00'), Timestamp('2018-09-28 00:00:00'), Timestamp('2018-09-30 00:00:00'), Timestamp('2018-10-02 00:00:00'), Timestamp('2018-10-04 00:00:00'), Timestamp('2018-10-06 00:00:00'), Timestamp('2018-10-09 00:00:00'), Timestamp('2018-11-03 00:00:00'), Timestamp('2018-11-06 00:00:00'), Timestamp('2018-11-09 00:00:00')]


In [45]:
df_selected = df[df["Date"].isin(FD_date_list)]

numerical_cols = df_selected.select_dtypes(include=["float64", "int64"])

df_mean = df_selected.groupby("Date")[numerical_cols.columns].mean()
df_mean


Unnamed: 0_level_0,Rented Bike Count,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-04-11,0.0,12.866667,53.375,3.008333,1331.541667,2.120833,1.035833,0.0,0.0
2018-05-10,0.0,15.470833,65.833333,2.254167,1099.958333,8.7,1.105417,0.0,0.0
2018-09-18,0.0,21.85,59.666667,1.329167,1819.125,13.058333,0.660417,0.0,0.0
2018-09-19,0.0,22.204167,59.75,0.95,1729.208333,13.591667,0.444167,0.008333,0.0
2018-09-28,0.0,17.525,57.291667,1.295833,1999.333333,8.833333,0.365,0.0,0.0
2018-09-30,0.0,17.916667,53.791667,2.025,1928.25,7.6,0.70625,0.0,0.0
2018-10-02,0.0,15.783333,60.791667,1.629167,1843.708333,7.541667,0.785,0.0,0.0
2018-10-04,0.0,18.895833,56.833333,1.2875,1956.125,9.591667,0.661667,0.0,0.0
2018-10-06,668.208333,18.020833,84.166667,2.191667,1532.208333,15.0875,0.285833,2.354167,0.0
2018-10-09,0.0,15.0,52.625,1.1375,1992.583333,5.008333,0.370417,0.0,0.0


In [46]:
df.describe() #위 데이터에 이상값이 있는지 확인하기 위해 출력

Unnamed: 0,Date,Rented Bike Count,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm)
count,8760,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,2018-05-31 23:59:59.999999744,704.602055,12.882922,58.226256,1.724909,1436.825799,4.073813,0.569111,0.148687,0.075068
min,2017-12-01 00:00:00,0.0,-17.8,0.0,0.0,27.0,-30.6,0.0,0.0,0.0
25%,2018-03-02 00:00:00,191.0,3.5,42.0,0.9,940.0,-4.7,0.0,0.0,0.0
50%,2018-06-01 00:00:00,504.5,13.7,57.0,1.5,1698.0,5.1,0.01,0.0,0.0
75%,2018-08-31 00:00:00,1065.25,22.5,74.0,2.3,2000.0,14.8,0.93,0.0,0.0
max,2018-11-30 00:00:00,3556.0,39.4,98.0,7.4,2000.0,27.2,3.52,35.0,8.8
std,,644.997468,11.944825,20.362413,1.0363,608.298712,13.060369,0.868746,1.128193,0.436746


해당 날짜들은 특별히 날씨가 특이한 날이 아니기에 기상 악화로 인해 운영 중단한 것이라고 볼 수 없음.

점검이나 데이터 수집 오류로 인해 Functional Day가 아닌 것으로 추정.

**Non-Functional Day는 공유 자전거 이용량 분석에 왜곡을 일으킬 수 있으므로, 제거하는 것이 적절하다고 판단.**

In [47]:
df.shape

(8760, 14)

In [48]:
df = df[df["Functional Day"] != "No"] #제거

In [49]:
df.shape

(8465, 14)

In [50]:
df.drop(columns=["Functional Day"], inplace=True)#Functional Day 변수 삭제

In [51]:
df

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8755,2018-11-30,1003,19,4.2,34,2.6,1894,-10.3,0.0,0.0,0.0,Autumn,No Holiday
8756,2018-11-30,764,20,3.4,37,2.3,2000,-9.9,0.0,0.0,0.0,Autumn,No Holiday
8757,2018-11-30,694,21,2.6,39,0.3,1968,-9.9,0.0,0.0,0.0,Autumn,No Holiday
8758,2018-11-30,712,22,2.1,41,1.0,1859,-9.8,0.0,0.0,0.0,Autumn,No Holiday


### **Holiday 변수 숫자로 매핑**

In [53]:
df["Holiday_dummy"] = df["Holiday"].map({"No Holiday":0, "Holiday":1})
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Day of Week,Holiday_dummy
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Friday,0
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Friday,0
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Friday,0
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Friday,0
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Friday,0


In [54]:
df.drop(columns=["Holiday"],inplace=True)
df.rename(columns={"Holiday_dummy": "Holiday"},inplace=True)

df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Day of Week,Holiday
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,Friday,0
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,Friday,0
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,Friday,0
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,Friday,0
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,Friday,0


### **파생 변수 생성**

**요일 변수 생성**


In [52]:
#요일 변수 생성
df["Day of Week"] = df["Date"].dt.day_name()
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Day of Week
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Friday
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Friday
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Friday
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Friday
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Friday


***주말 여부 변수 생성***

In [56]:
df["Weekend"] = np.where(df["Day of Week"].isin(["Saturday", "Sunday"]), 1, 0)
df

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Day of Week,Holiday,Weekend
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,Friday,0,0
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,Friday,0,0
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,Friday,0,0
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,Friday,0,0
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,Friday,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8755,2018-11-30,1003,19,4.2,34,2.6,1894,-10.3,0.0,0.0,0.0,Autumn,Friday,0,0
8756,2018-11-30,764,20,3.4,37,2.3,2000,-9.9,0.0,0.0,0.0,Autumn,Friday,0,0
8757,2018-11-30,694,21,2.6,39,0.3,1968,-9.9,0.0,0.0,0.0,Autumn,Friday,0,0
8758,2018-11-30,712,22,2.1,41,1.0,1859,-9.8,0.0,0.0,0.0,Autumn,Friday,0,0
