Data Analysis HW2

這次作的主題是共享單車的租賃情況，利用自己設計的特徵以及decision tree作分析並且比較兩者之間的結果差異。

import library

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO   
from sklearn.tree import export_graphviz
import pydotplus

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

利用自己設計的特徵分析

先自行猜測那些特徵會影響租車數量

1.溫度 2.風速 3.天氣

溫度小於10度風速大於20 天氣屬於第 3 4 種類都會造成租乘數量下降

meaning of weather number 1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

df_guess=pd.read_csv('./train.csv')

def badFactor(row):
    if row['temp']<=10:
        val=0
    elif row['windspeed']>=20:
        val=0
    elif row['weather']==3 or row['weather']==4:
        val=0
    else:
        val=1
        
    return val

df_guess['guess']=df_guess.apply(badFactor,axis=1)

利用平均值決定當天的租車數量為高或低

mean=df_guess['count'].mean()
def amount(row):
    if row['count']>=mean:
        val=1
    else:
        val=0
    return val

df_guess['rentAmount']=df_guess.apply(amount,axis=1)

result

guess_result=df_guess['guess'].values
truth=df_guess['rentAmount'].values

from sklearn.metrics import accuracy_score
accuracy_score(truth,guess_result)

0.5306816094065773 以我自行預想的特徵篩選出來有53percent的準確度，只比隨機猜測好上一些

利用Decision Tree作分析

mapping the weather data to 0-1

weather_map={1:0,2:33,3:66,4:100}
df['weather']=df['weather'].map(weather_map)
df.head(20)

算出當日registered與casual租借用戶的比例

def ratio(row):
    if row['casual']==0:
        val=0
    else:
        val=row['registered']/row['casual']
    return val

df['registered ratio']=df.apply(ratio,axis=1)

將日期分為上班日與非上班日，消去holiday的影響

def work(row):
    if row['holiday']==1:
        val=0
    elif row['holiday']==0 and row['workingday']==0:
        val=0
    else:
        val=1
    return val

df['workingdayOrNot']=df.apply(work,axis=1)

drop the unnecessary feature

df=df.drop(['count','holiday','workingday','datetime',
            'registered','casual'],axis=1)

decision tree

#將data 分為training set 與 valid set
train=df[:8000]
valid=df[8000:]

demand=train['rentAmount'].values
train=train.drop('rentAmount',1)

dtree=DecisionTreeClassifier(max_depth=7)
dtree.fit(train,demand)

dot_data=StringIO()
export_graphviz(dtree,dot_data,filled=True,feature_names=list(train)
                ,class_names=['low demand','high demand']
                ,special_characters=True)
graph=pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf('dtree_bike.pdf')

dtree.feature_importances_

由上述此項結果可得知影響租車數量最多特徵的是體感溫度

計算準確度

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predict)

結論

經過decision tree 演算法之後，得出的結果準確率可接近70%，比自行猜測特徵的方法準確不少

我認為兩者的結果差異可能在於沒有用到最相關的特徵，導致準確率只有五成多。

至於如果要改進利用decision tree的效能，我認為如果能將相關度較高的幾個特徵權重提高，應該能得到更好的效果。

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
hw2		hw2
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analysis HW2

import library

利用自己設計的特徵分析

result

利用Decision Tree作分析

結論

About

Releases

Packages

Languages

steakforlife/Data-Analysis-HW2

Folders and files

Latest commit

History

Repository files navigation

Data Analysis HW2

import library

利用自己設計的特徵分析

result

利用Decision Tree作分析

結論

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages