# A Data Science Framework: To Achieve 99% Accuracy

## 数据科学家如何战胜几率

It's the classical problem, predict the outcome of a binary event. In laymen terms this means, it either occurred or did not occur. For example, you won or did not win, you passed the test or did not pass the test, you were accepted or not accepted, and you get the point. A common business application is churn or customer retention. Another popular use case is, healthcare's mortality rate or survival analysis. Binary events create an interesting dynamic, because we know statistically, a random guess should achieve a 50% accuracy rate, without creating one single algorithm or writing one single line of code. However, just like autocorrect spellcheck technology, sometimes we humans can be too smart for our own good and actually underperform a coin flip. In this kernel, I use Kaggle's Getting Started Competition, Titanic: Machine Learning from Disaster, to walk the reader through, how-to use the data science framework to beat the odds.

## 数据科学框架

### 明确问题
问题 -> 需求 -> 解决方案 -> 设计 -> 技术

### 收集数据

### 数据预处理
此步骤是将原始数据转换为“可管理”数据的必需过程。 包括实现用于存储和处理的数据架构，开发用于质量和控制的数据治理标准，数据提取（即ETL和网络抓取）以及用于识别异常，丢失或异常数据点的数据清理

### 探索性分析
Garbage-in, garbage-out. 利用描述性和图形化的统计信息去查找数据集中的潜在问题，模式，分类，相关性和比较非常重要。 此外，数据分类（即定性与定量）对于理解和选择正确的假设检验或数据模型也很重要

### 建模
选择合适的算法

### 验证和应用模型
避免过拟合

### 优化策略

## Step 1: 明确问题
哪种类型的乘客更有可能生存下来

## Step 2: 收集数据
数据存放路径：H:/python/Kaggle/titanic

## Step 3: 数据预处理

### 加载分析模块

In [5]:
#load packages
import sys 
print("Python version: {}". format(sys.version))

import pandas as pd 
print("pandas version: {}". format(pd.__version__))

import matplotlib 
print("matplotlib version: {}". format(matplotlib.__version__))

import numpy as np 
print("NumPy version: {}". format(np.__version__))

import scipy as sp 
print("SciPy version: {}". format(sp.__version__)) 

import sklearn 
print("scikit-learn version: {}". format(sklearn.__version__))

#misc libraries
import random
import time

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

Python version: 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)]
pandas version: 0.25.0
matplotlib version: 3.1.1
NumPy version: 1.17.0
SciPy version: 1.3.0
scikit-learn version: 0.21.3


In [3]:
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
#from pandas.tools.plotting import scatter_matrix

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

### 了解数据

In [6]:
data_raw = pd.read_csv('H:/python/Kaggle/titanic/train.csv')

In [9]:
data_raw.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
data_val = pd.read_csv('H:/python/Kaggle/titanic/test.csv')

In [13]:
data1 = data_raw.copy(deep = True)
data_cleaner = [data1, data_val]

In [14]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### 数据清洗

#### Correcting（修正数据）
修正变量中有明显错误的数据，如年龄为800

#### Completing（处理缺失值）
有两种常用方法，要么删除记录，要么使用合理的输入填充缺失值。建议不要删除记录，而是使用均值，中位数或均值+随机标准差来估算。

#### Creating（创造新特征）
后续用于特征工程

#### Converting（数据转化）
数据类型转化，如连续变量转换为分类变量