# Pandas教程

### 2018七月在线 机器学习集训营 julyedu.com
by 褚则伟 zeweichu@gmail.com

pandas是一个专门用于数据分析的python library

## [Pandas](http://pandas.pydata.org/)简介
- python数据分析library
- 基于numpy (对ndarray的操作)
- 有一种用python做Excel/SQL/R的感觉

## 目录
- Series
- DataFrame
- Index
- 文件读写

## 数据结构Series

### 2018七月在线 机器学习集训营 julyedu.com

### 构造和初始化Series

In [2]:
import pandas as pd
import numpy as np

Series是一个一维的数据结构，下面是一些初始化Series的方法。

In [3]:
s = pd.Series([7, 'Beijing', 2.17, -12344, 'Happy Birthday!'])
s

0                  7
1            Beijing
2               2.17
3             -12344
4    Happy Birthday!
dtype: object

pandas会默认用0到n-1来作为Series的index，但是我们也可以自己指定index。index我们可以把它理解为dict里面的key。

In [3]:
s = pd.Series([7, 'Beijing', 2.17, -12344, 'Happy Birthday!'],
             index=['A', 'B', 'C', 'D', 'E'])
s

A                  7
B            Beijing
C               2.17
D             -12344
E    Happy Birthday!
dtype: object

还可以用dictionary来构造一个Series，因为Series本来就是key value pairs。

In [4]:
cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000, 'Hangzhou': 20000, 'Guangzhou': 25000, 'Suzhou': None}
# apts = pd.Series(cities)
apts = pd.Series(cities, name="price")
apts

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

numpy ndarray构建一个Series

In [5]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.782431
b   -0.643592
c   -0.600907
d    2.685830
e   -1.047606
dtype: float64

### 选择数据

我们可以像对待一个list一样对待Series

In [6]:
apts[[4,3,1]]

Shenzhen     50000.0
Shanghai     60000.0
Guangzhou    25000.0
Name: price, dtype: float64

In [7]:
apts[1:]

Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

In [8]:
apts[:-1]

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Name: price, dtype: float64

为什么下面这样会拿到两个NaN呢？

In [9]:
apts[1:] + apts[:-1]

Beijing           NaN
Guangzhou     50000.0
Hangzhou      40000.0
Shanghai     120000.0
Shenzhen     100000.0
Suzhou            NaN
Name: price, dtype: float64

Series就像一个dict，前面定义的index就是用来选择数据的

In [10]:
apts["Hangzhou"]

20000.0

In [11]:
apts[["Hangzhou", "Beijing", "Shenzhen"]]

Hangzhou    20000.0
Beijing     55000.0
Shenzhen    50000.0
Name: price, dtype: float64

In [12]:
"Hangzhou" in apts

True

In [13]:
"Chongqing" in apts

False

比较安全的用key读取value的方法如下

In [14]:
apts.get("Chongqing", 0)

0

下面这种写法，如果key不存在，就可能会报错了

In [15]:
# apts["Chongqing"]

In [16]:
apts.get("Hangzhou", 0)

20000.0

boolean indexing，与numpy类似。

In [17]:
apts[apts < 50000]

Guangzhou    25000.0
Hangzhou     20000.0
Name: price, dtype: float64

In [18]:
apts.median()

50000.0

In [19]:
apts[apts > apts.median()]

Beijing     55000.0
Shanghai    60000.0
Name: price, dtype: float64

下面我再详细展示一下这个boolean indexing是如何工作的

In [20]:
less_than_50000 = apts < 50000
print(less_than_50000)

Beijing      False
Guangzhou     True
Hangzhou      True
Shanghai     False
Shenzhen     False
Suzhou       False
Name: price, dtype: bool


In [21]:
print(apts[less_than_50000])

Guangzhou    25000.0
Hangzhou     20000.0
Name: price, dtype: float64


### Series元素赋值

Series的元素可以被赋值

In [22]:
print("Old value: ", apts['Shenzhen'])
apts['Shenzhen'] = 55000
print("New value: ", apts['Shenzhen'])

Old value:  50000.0
New value:  55000.0


前面讲过的boolean indexing在赋值的时候也可以用

In [23]:
print(apts[apts < 50000])
print()
apts[apts <= 50000] = 40000
print(apts[apts < 50000])

Guangzhou    25000.0
Hangzhou     20000.0
Name: price, dtype: float64

Guangzhou    40000.0
Hangzhou     40000.0
Name: price, dtype: float64


### 数学运算

下面我们来讲一些基本的数学运算。

In [24]:
apts / 2

Beijing      27500.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     30000.0
Shenzhen     27500.0
Suzhou           NaN
Name: price, dtype: float64

In [25]:
apts ** 2

Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Suzhou                NaN
Name: price, dtype: float64

numpy的运算可以被运用到pandsa上去

In [26]:
np.square(apts)

Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Suzhou                NaN
Name: price, dtype: float64

我们再定义一个新的Series做加法

In [27]:
cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000, \
                      'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64

In [28]:
print(cars + apts * 100)

Beijing      5800000.0
Chongqing          NaN
Guangzhou    4200000.0
Hangzhou           NaN
Shanghai     6400000.0
Shenzhen     5800000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64


### 数据缺失

[reference](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

In [29]:
print('Hangzhou' in apts)
print('Hangzhou' in cars)

True
False


In [30]:
apts.notnull()

Beijing       True
Guangzhou     True
Hangzhou      True
Shanghai      True
Shenzhen      True
Suzhou       False
Name: price, dtype: bool

In [31]:
print(apts.isnull())

Beijing      False
Guangzhou    False
Hangzhou     False
Shanghai     False
Shenzhen     False
Suzhou        True
Name: price, dtype: bool


In [32]:
print(apts[apts.isnull()])

Suzhou   NaN
Name: price, dtype: float64


In [33]:
print(apts[apts.isnull() == False])

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
Name: price, dtype: float64


## 数据结构[Dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

### 七月在线 julyedu.com

一个Dataframe就是一张表格，Series表示的是一维数组，Dataframe则是一个二维数组，可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。

### 创建一个DataFrame

dataframe可以由一个dictionary构造得到。

In [34]:
data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
       'year': [2016,2017,2016,2017,2016, 2016],
       'population': [2100, 2300, 1000, 700, 500, 500]}
print(pd.DataFrame(data))

        city  population  year
0    Beijing        2100  2016
1   Shanghai        2300  2017
2  Guangzhou        1000  2016
3   Shenzhen         700  2017
4   Hangzhou         500  2016
5  Chongqing         500  2016


columns的名字和顺序可以指定

In [35]:
print(pd.DataFrame(data, columns=['year', 'city', 'population']))

   year       city  population
0  2016    Beijing        2100
1  2017   Shanghai        2300
2  2016  Guangzhou        1000
3  2017   Shenzhen         700
4  2016   Hangzhou         500
5  2016  Chongqing         500


In [36]:
frame = pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six'])
print(frame)

       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN


也可以从几个Series构建一个DataFrame

In [37]:
apts

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
Suzhou           NaN
Name: price, dtype: float64

In [38]:
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64

In [39]:
df = pd.DataFrame({"apts": apts, "cars": cars})
df

Unnamed: 0,apts,cars
Beijing,55000.0,300000.0
Chongqing,,150000.0
Guangzhou,40000.0,200000.0
Hangzhou,40000.0,
Shanghai,60000.0,400000.0
Shenzhen,55000.0,300000.0
Suzhou,,
Tianjin,,200000.0


也可以用一个list of dicts来构建DataFrame

In [40]:
data = [{"July": 999999, "Han": 50000, "Zewei": 1000}, {"July": 99999, "Han": 8000, "Zewei": 200}]
pd.DataFrame(data)

Unnamed: 0,Han,July,Zewei
0,50000,999999,1000
1,8000,99999,200


In [41]:
pd.DataFrame(data, index=["salary", "bonus"])

Unnamed: 0,Han,July,Zewei
salary,50000,999999,1000
bonus,8000,99999,200


## 数据结构Series

### 2018七月在线 机器学习集训营 julyedu.com
by 褚则伟 zeweichu@gmail.com

In [42]:
df["apts"]

Beijing      55000.0
Chongqing        NaN
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
Suzhou           NaN
Tianjin          NaN
Name: apts, dtype: float64

In [43]:
df["total_cost"] = df["apts"]*100 + df["cars"]
df

Unnamed: 0,apts,cars,total_cost
Beijing,55000.0,300000.0,5800000.0
Chongqing,,150000.0,
Guangzhou,40000.0,200000.0,4200000.0
Hangzhou,40000.0,,
Shanghai,60000.0,400000.0,6400000.0
Shenzhen,55000.0,300000.0,5800000.0
Suzhou,,,
Tianjin,,200000.0,


In [44]:
print(frame['city'])
type(frame['city'])

one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object


pandas.core.series.Series

In [45]:
print(frame.year)
type(frame.year)

one      2016
two      2017
three    2016
four     2017
five     2016
six      2016
Name: year, dtype: int64


pandas.core.series.Series

loc方法可以拿到行

In [46]:
print(frame.loc['three'])
type(frame.loc['three'])

year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object


pandas.core.series.Series

下面这种方法默认用来选列而不是选行

iloc方法可以拿到行和列，把pandas dataframe当做numpy的ndarray来操作

In [47]:
frame.iloc[1]

year              2017
city          Shanghai
population        2300
debt               NaN
Name: two, dtype: object

In [48]:
frame.iloc[1:3, 2:4]

Unnamed: 0,population,debt
two,2300,
three,1000,


### DataFrame元素赋值

In [49]:
frame["population"]["one"] = 2200

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [50]:
frame.loc["one", "population"] = 2200

可以给一整列赋值

In [51]:
frame['debt'] = 100
print(frame)

       year       city  population  debt
one    2016    Beijing        2200   100
two    2017   Shanghai        2300   100
three  2016  Guangzhou        1000   100
four   2017   Shenzhen         700   100
five   2016   Hangzhou         500   100
six    2016  Chongqing         500   100


In [52]:
frame.loc['six'] = 0
print(frame)

       year       city  population  debt
one    2016    Beijing        2200   100
two    2017   Shanghai        2300   100
three  2016  Guangzhou        1000   100
four   2017   Shenzhen         700   100
five   2016   Hangzhou         500   100
six       0          0           0     0


In [53]:
frame
frame.index
frame['city']

one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six              0
Name: city, dtype: object

In [54]:
frame.debt = np.arange(6)
print(frame)

       year       city  population  debt
one    2016    Beijing        2200     0
two    2017   Shanghai        2300     1
three  2016  Guangzhou        1000     2
four   2017   Shenzhen         700     3
five   2016   Hangzhou         500     4
six       0          0           0     5


还可以用Series来指定需要修改的index以及相对应的value，没有指定的默认用NaN.

In [55]:
val = pd.Series([100, 200, 300], index=['two', 'three', 'five'])
frame['debt'] = val
print(frame)

       year       city  population   debt
one    2016    Beijing        2200    NaN
two    2017   Shanghai        2300  100.0
three  2016  Guangzhou        1000  200.0
four   2017   Shenzhen         700    NaN
five   2016   Hangzhou         500  300.0
six       0          0           0    NaN


In [56]:
frame['western'] = (frame.city == 'Chongqing')
print(frame)

       year       city  population   debt  western
one    2016    Beijing        2200    NaN    False
two    2017   Shanghai        2300  100.0    False
three  2016  Guangzhou        1000  200.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  300.0    False
six       0          0           0    NaN    False


如果我们想要知道有哪些列，直接用columns

In [57]:
print(frame.columns)

Index(['year', 'city', 'population', 'debt', 'western'], dtype='object')


行的话就叫做index啦

In [58]:
frame.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

一个DataFrame就和一个numpy 2d array一样，可以被转置

In [59]:
pop = {'Beijing': {2016: 2100, 2017:2200},
      'Shanghai': {2015:2400, 2016:2500, 2017:2600}}

In [60]:
frame2 = pd.DataFrame(pop)
print(frame2)
print(frame2.T)

      Beijing  Shanghai
2015      NaN      2400
2016   2100.0      2500
2017   2200.0      2600
            2015    2016    2017
Beijing      NaN  2100.0  2200.0
Shanghai  2400.0  2500.0  2600.0


指定index的顺序，以及使用切片初始化数据

In [61]:
print(pd.DataFrame(pop, index=[2016,2015,2017]))

      Beijing  Shanghai
2016   2100.0      2500
2015      NaN      2400
2017   2200.0      2600


In [62]:
pdata = {'Beijing': frame2['Beijing'][:-1], 'Shanghai':frame2['Shanghai'][:-1]}
print(pd.DataFrame(pdata))

      Beijing  Shanghai
2015      NaN      2400
2016   2100.0      2500


我们还可以指定index的名字和列的名字

In [63]:
frame2.index.name = 'year'
frame2.columns.name = 'city'
print(frame2)

city  Beijing  Shanghai
year                   
2015      NaN      2400
2016   2100.0      2500
2017   2200.0      2600


In [64]:
print(frame2.values)
print(frame)
print(type(frame.values))

[[   nan  2400.]
 [ 2100.  2500.]
 [ 2200.  2600.]]
       year       city  population   debt  western
one    2016    Beijing        2200    NaN    False
two    2017   Shanghai        2300  100.0    False
three  2016  Guangzhou        1000  200.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  300.0    False
six       0          0           0    NaN    False
<class 'numpy.ndarray'>


In [65]:
df.values

array([[   55000.,   300000.,  5800000.],
       [      nan,   150000.,       nan],
       [   40000.,   200000.,  4200000.],
       [   40000.,       nan,       nan],
       [   60000.,   400000.,  6400000.],
       [   55000.,   300000.,  5800000.],
       [      nan,       nan,       nan],
       [      nan,   200000.,       nan]])

In [66]:
df.as_matrix()

array([[   55000.,   300000.,  5800000.],
       [      nan,   150000.,       nan],
       [   40000.,   200000.,  4200000.],
       [   40000.,       nan,       nan],
       [   60000.,   400000.,  6400000.],
       [   55000.,   300000.,  5800000.],
       [      nan,       nan,       nan],
       [      nan,   200000.,       nan]])

## Index
### 2018七月在线 机器学习集训营  julyedu.com

### index object

In [67]:
obj = pd.Series(range(3), index = ['a', 'b', 'c'])
index = obj.index
print(index)
print(index[1:])

Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')


index的值是不能被更改的

In [68]:
index[1] = 'd'

TypeError: Index does not support mutable operations

In [None]:
index = pd.Index(np.arange(3))
obj2 = pd.Series([2,5,7], index=index)
print(obj2)
print(obj2.index is index)
print(obj2.index is np.arange(3))

In [None]:
pop = {'Beijing': {2016: 2100, 2017:2200},
      'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print('Shanghai' in frame3.columns)
print(2015 in frame3.index)

### 针对index进行索引和切片

In [None]:
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
print(obj['b'])

默认的数字index依旧可以使用

In [None]:
print(obj[3])
print()
print(obj[[1,3]])

In [None]:
print(obj[obj<2])

下面介绍如何对Series进行切片

In [None]:
print(obj['b':'c'])
obj['b':'c'] = 5
print(obj)

对DataFrame进行Indexing与Series基本相同

In [None]:
frame = pd.DataFrame(np.arange(9).reshape(3,3), 
                    index = ['a', 'c', 'd'],
                    columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])

In [None]:
print(frame)

In [None]:
print(frame['Hangzhou'])

In [None]:
print(frame[['Shenzhen', 'Nanjing']])

In [None]:
print(frame[:2])

In [None]:
print(frame.loc['a'])

In [None]:
print(frame.loc[['a','d'], ['Shenzhen', 'Nanjing']])

In [None]:
print(frame.loc[:'c', 'Hangzhou'])

DataFrame也可以用condition selection

In [None]:
print(frame[frame.Hangzhou > 1])

In [None]:
print(frame < 5)

In [None]:
frame[frame < 5] = 0
print(frame)

### [reindex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html)

把一个Series或者DataFrame按照新的index顺序进行重排

In [None]:
import numpy as np
import pandas as pd

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
print(obj)

In [None]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2)

In [None]:
print(obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0))

In [None]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0,2,4])
print(obj3)

In [None]:
print(obj3.reindex(range(6), method='ffill'))

In [None]:
print(obj3.reindex(range(6), method='bfill'))

既然我们可以对Series进行reindex，相应地，我们也可以用同样的方法对DataFrame进行reindex。

In [None]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)

在reindex的同时，我们还可以重新指定columns

In [None]:
print(frame.reindex(columns = ['Shenzhen', 'Hangzhou', 'Chongqing']))

In [None]:
print(frame.reindex(index = ['a', 'b', 'c', 'd'],
                    columns = ['Chongqing', 'Hangzhou', 'Shenzhen']))
print(frame.loc[['a', 'b', 'c', 'd'],['Shenzhen', 'Hangzhou', 'Chongqing']])

下面介绍如何用drop来删除Series和DataFrame中的index，注意drop的效果不是in place的，也就是说他会返回一个object，原来的Obejct并没有被改变

In [None]:
print(obj3)
obj4 = obj3.drop(2)
print(obj4)

In [None]:
print(obj3.drop([2,4]))

In [None]:
print(frame)

In [None]:
print(frame.drop(['a', 'c']))

drop不仅仅可以删除行，还可以删除列

In [None]:
print(frame.drop('Shenzhen', axis=1))

In [None]:
print(frame.drop(['Shenzhen', 'Hangzhou'], axis=1))

### hierarchical index

In [None]:
import numpy as np
import pandas as pd

Series的hierarchical indexing

In [None]:
data = pd.Series(np.random.randn(10), index=[['a','a','a','b','b','c','c','c','d','d'], [1,2,3,1,2,1,2,3,1,2]])
print(data)

In [None]:
print(data.index)

In [None]:
print(data.b)

In [None]:
print(data['b':'c'])

In [None]:
print(data[:2])

unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。

In [None]:
print(data.unstack())
print(type(data.unstack()))

In [None]:
print(data.unstack().stack())
print(type(data.unstack().stack()))

DataFrame的hierarchical indexing

In [None]:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index = [['a','a','b','b'], [1,2,1,2]],
                    columns = [['Beijing', 'Beijing', 'Shanghai'], ['apts', 'cars', 'apts']])
print(frame)

In [None]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['city', 'type']
print(frame)

In [None]:
print(frame.loc['a', 1])
print(type(frame.loc['a', 1]))

In [None]:
print(frame.loc['a', 2]['Beijing'])

In [None]:
print(frame.loc['a', 2]['Beijing']['apts'])

In [None]:
print(frame.loc['a'])

### 2018七月在线 机器学习集训营 julyedu.com
by 褚则伟 zeweichu@gmail.com

## csv文件读写

In [None]:
goog = pd.read_csv("data/GOOG.csv")
goog.head()

In [None]:
goog = pd.read_csv("data/GOOG.csv", index_col=0)
print(goog.head())
goog.index

In [None]:
goog = goog.reindex(pd.to_datetime(goog.index))
goog.index

In [None]:
goog = pd.read_csv("data/GOOG.csv", index_col=0, parse_dates=[0])
goog.index

In [None]:
goog.head()

In [None]:
goog.tail()

In [None]:
%matplotlib inline
goog["Adj Close"].plot()

In [None]:
df = pd.DataFrame(np.random.rand(10, 4), columns=list("abcd"))
df

In [None]:
df.to_csv("data/sample.tsv", sep="\t")

## 作业

- 构建三个Series，分别是一系列商品的单价，计量单位，和数量。至于是什么商品什么计量单位由大家自己决定。

In [None]:
price = pd.Series([20, 2, 3, 50, 40],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])
unit = pd.Series(["kg", "each", "each", "each", "kg"],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])
amount = pd.Series([5, 10, 6, 1, 2],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])

- 然后把这三个Series合并成一个DataFrame。

In [None]:
fruit_df = pd.DataFrame({"price": price, "unit": unit, "amount": amount}, columns=["price", "unit", "amount"])
fruit_df

- 请同学们自行从yahoo finance下载一些股票数据，然后用read_csv载入并作出折线图

In [None]:
nvda = pd.read_csv("data/NVDA.csv", index_col=0, parse_dates=[0])
nvda["Adj Close"].plot()

- data文件夹下有个文件叫做titanic.csv ，这个文件包含了在titanic事件中乘客的存货情况。请把这个文件读入成一个pandas dataframe

In [None]:
df = pd.read_csv("data/titanic.csv")
df.head()

- 把每一列中的sex，sex为male的都改成1，sex为female的改成0

In [None]:
df.loc[df["Sex"] == "male", ["Sex"]] = 1
df.loc[df["Sex"] == "female", ["Sex"]] = 0
df.head()

- 把Cabin中的NaN都填充成0 (使用fillna方法)

In [None]:
df.loc[:, "Cabin"] = df.loc[:, "Cabin"].fillna(0)
df.head()

- 现在我们想把乘客按照年龄做一个分类，年龄在0-11岁的为0类，12-22为1类，23-33为2类，34-44为3类，45-55为4类，56-66为5类，其余为6类，请做把Age改成乘客的年龄类别，如果年龄有缺失，就放入乘客的平均年龄

In [None]:
df.loc[df["Age"].isnull(), "Age"] = df["Age"].mean()
df['Age'] = df['Age'].astype(int)
df.loc[ df['Age'] <= 11, 'Age'] = 0
df.loc[(df['Age'] > 11) & (df['Age'] <= 22), 'Age'] = 1
df.loc[(df['Age'] > 22) & (df['Age'] <= 33), 'Age'] = 2
df.loc[(df['Age'] > 33) & (df['Age'] <= 44), 'Age'] = 3
df.loc[(df['Age'] > 44) & (df['Age'] <= 55), 'Age'] = 4
df.loc[(df['Age'] > 55) & (df['Age'] <= 66), 'Age'] = 5
df.loc[ df['Age'] > 66, 'Age'] = 6

In [None]:
df.head()

我们前面做的这一系列操作都属于数据预处理的范畴。在做实际machine learning问题的时候，很多时候我们都需要对数据进行预处理操作，方便后续的建模。