# 读CSV

read_csv()函数的相关参数

- index_col，指定要读取csv中的哪一列作为行标签
- header，指定要读取csv中的哪一行作为列标签
- sep，指定分隔符
- names，指定各列名称
- usecols，可以传入“列名称列表”或者“列索引列表”来指定导入对应的列。

## 读取空的csv

- 当读取的是空文件时，会出错

In [7]:
import pandas as pd

# 当第一行为`a,b,c`时，输出为`a	b c`
df = pd.read_csv("ex_empty_csv_onlyheader.csv")
print(df)

Empty DataFrame
Columns: [a, b, c]
Index: []


In [9]:
import pandas as pd

# 当第一行为`,a,b,c`时，输出为`Unnamed: 0, a, b, c`
df = pd.read_csv("ex_empty_csv_onlyheader.csv")
print(df)

Empty DataFrame
Columns: [Unnamed: 0, a, b, c]
Index: []


## 指定行标签（使用`index_col`参数）

先看一下，读取csv_write_ex.csv的效果，打开该csv文件显示如下：

    ,a,b,c
    0,61,69,38
    1,64,2,6
    2,74,70,81
    3,43,66,47
    4,54,86,70


当我们直接读取时，看到的结果是这样：

In [2]:
import pandas as pd
import numpy as np

pd.read_csv("ex_header.csv")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0.1,Unnamed: 0,a,b,c
0,0,61,69,38
1,1,64,2,6
2,2,74,70,81
3,3,43,66,47
4,4,54,86,70


也就是，使用`read_csv()`没用给定任何参数的时候，会自动添加0，1，2...递增的行标签，但实际上原来的行标签已经保存在里面了。我们可以直接使用已有的行标签，因此需要通过`index_col`来指定

In [3]:
pd.read_csv("ex_header.csv", index_col=1)

Unnamed: 0_level_0,Unnamed: 0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
61,0,69,38
64,1,2,6
74,2,70,81
43,3,66,47
54,4,86,70


但是，如果使用了`usecols`，那么就可以不用再使用`index_col`了，比如：

In [6]:
pd.read_csv("ex_header.csv", index_col=0, usecols=['a', 'b'])

Unnamed: 0_level_0,b
a,Unnamed: 1_level_1
61,69
64,2
74,70
43,66
54,86


## 指定列标签（使用`header`参数）

直接读取csv时，默认以第一行，即`header = 0`作为标题，如果不需要标题，那么设置`header=None`，此时第一行也会被当做数据。

*注：这里需要注意区分pandas在读取时候对于行列标签默认处理的不同，会默认新增行标签，但会默认以第一行作为列标签。*

In [3]:
import pandas as pd

pd.read_csv("ex_header.csv", index_col=0, header=None)

Unnamed: 0_level_0,1,2,3
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,a,b,c
0.0,61,69,38
1.0,64,2,6
2.0,74,70,81
3.0,43,66,47
4.0,54,86,70


In [None]:
pd.read_csv("ex_header.csv", index_col=0, header=1)

## 指定列名称（使用`names`）

使用`names`参数可以指定各列名称，该参数通常是在原始csv里面并没用header数据，所以需要另外指定的情况：

In [10]:
pd.read_csv("csv_write_ex.csv", index_col=0, names=['A', 'B', 'C'])

Unnamed: 0,A,B,C
,a,b,c
0.0,61,69,38
1.0,64,2,6
2.0,74,70,81
3.0,43,66,47
4.0,54,86,70


读取没有列标题的csv，如果不指定`name`参数，那么会默认以第一行作为标题行。


In [1]:
import pandas as pd 

pd.read_csv("ex_noheader.csv", sep = ";", names = ["time", "buildid", "owner", "gerrit"])

Unnamed: 0,time,buildid,owner,gerrit
0,2023-02-27 20:53,1831,krzysztof.wieczorek@nokia.com,https://gerrit.ext.net.nokia.com/gerrit/c/MN/5...
1,2023-02-27 19:30,1827,maciej.rekiel@nokia.com,https://gerrit.ext.net.nokia.com/gerrit/c/MN/5...


In [6]:
df = pd.read_csv("ex_noheader.csv", sep = ";", index_col=1, names = ["time", "buildid", "owner", "gerrit"])

print(df[df['owner'] == 'krzysztof.wieczorek@nokia.com']['gerrit'].iloc[0])
print(df.loc[1827, 'owner'])

<pandas.core.indexing._iLocIndexer object at 0x000001CB281299F0>
maciej.rekiel@nokia.com


## 选择读取特定列（使用`usecols`）

这里需要特别注意的有两点：

1. 不要和index_col混用，如果两者在一起，那么是先用usecols选择好数据，再应用index_col。
2. 在`usecols`指定的列标签并无法指定顺序，比如原始csv里面的列顺序是`a,b,c`，那么在读取的时候尽管使用`usecols=['b', 'c', 'a']`读取出来的数据列的顺序依然是`a,b,c`。

In [1]:
import pandas as pd

df = pd.read_csv("ex_header.csv", usecols=['a', 'b'])
print(df)
print(df.iloc[0,1])

    a   b
0  61  69
1  64   2
2  74  70
3  43  66
4  54  86
69


In [3]:
df = pd.read_csv("ex_header.csv", usecols=['b', 'a'])
print(df)
print(df.iloc[0,1])

    a   b
0  61  69
1  64   2
2  74  70
3  43  66
4  54  86
69


In [13]:
pd.read_csv("csv_write_ex.csv", index_col=0, usecols=['a', 'b'])

Unnamed: 0_level_0,b
a,Unnamed: 1_level_1
61,69
64,2
74,70
43,66
54,86


你可以使用lambeda表达式进行反向选取。

```
df = pd.read_csv("f500.csv", usecols = lambda column : column not in
["company" , "rank", "revenues"])
```

## 跳过特定行（使用`skiprows`）

- `skiprows = 2`代表不读取开头的两行。
- `skiprows = [0, 1, 3]`表示不读取第0，1，3行。

这个跳过的效果有点奇特，似乎是按照原始文本内容来跳过。

In [7]:
import pandas as pd

df1 = pd.read_csv("ex_header.csv", index_col=0)
print(df1)
print("--------")

df2 = pd.read_csv("ex_header.csv", skiprows=2, header=None)
print(df2)
print("length of dfs = {}".format(len(df2)))
print("--------")

df3 = pd.read_csv("ex_header.csv", index_col=0, skiprows=[0,2])
print(df3)
print("--------")

    a   b   c
0  61  69  38
1  64   2   6
2  74  70  81
3  43  66  47
4  54  86  70
--------
   0   1   2   3
0  1  64   2   6
1  2  74  70  81
2  3  43  66  47
3  4  54  86  70
length of dfs = 4
--------
   61  69  38
0            
2  74  70  81
3  43  66  47
4  54  86  70
--------


In [15]:
import os
import pandas as pd

root_path = r"C:\N-20L6PF1F2MV8-Data\lianbche\Documents\2024\02\26-script-adpat"
abio = os.path.join(root_path, 'vdu-sharemem.csv')
abio_df = pd.read_csv(abio, usecols=['Date', 'Share Memory'])
abio_df.drop_duplicates(subset='Date', keep='first', inplace=True)
print(abio_df)

           Date  Share Memory
0    2023-08-22   116963312.0
8    2023-08-21   116963312.0
16   2023-08-18   116962928.0
28   2023-08-17   116962928.0
33   2023-08-16   115570608.0
40   2023-08-15   115570608.0
42   2023-08-14   115570608.0
45   2023-08-11   115570608.0
54   2023-08-10   115570608.0
61   2023-08-09   115570608.0
67   2023-08-08   115570608.0
76   2023-08-07   115570608.0
82   2023-08-04   115570608.0
93   2023-08-03   115570608.0
102  2023-08-02   115570384.0
111  2023-08-01   115570384.0
116  2023-07-31   115570384.0
122  2023-07-28   115570384.0
128  2023-07-27   115570384.0
135  2023-07-26   115570384.0


## 导入时指定列的类型

使用`dtype`参数。

In [12]:
my_df = pd.read_csv("ex_header.csv", index_col=0)
my_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       5 non-null      int64
 1   b       5 non-null      int64
 2   c       5 non-null      int64
dtypes: int64(3)
memory usage: 160.0 bytes


以上从csv里面读取的数字默认`int64`类型，可以在读取的时候将其转换为`str`：

In [13]:
my_df = pd.read_csv("ex_header.csv", index_col=0, dtype={"b":"str", "c":"str"})
my_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       5 non-null      int64 
 1   b       5 non-null      object
 2   c       5 non-null      object
dtypes: int64(1), object(2)
memory usage: 160.0+ bytes


## 读取网络csv

In [None]:
import requests
import pandas as pd
from base64 import b64encode
from io import StringIO

url = 'url'
my_dict = {'key':'username', 'value':'password'}

# 方法一：需要pandas 1.2及以上版本
#pd.read_csv(url,
#            storage_options={'Authorization': b'Basic %s' % b64encode(bytes('{}:{}'.format(my_dict['key'], my_dict['value']), 'utf-8'))},
#            index_col=0).tail(5000)  

# 方法二：auth
response = requests.get(url, auth=(my_dict['key'], my_dict['value']), verify=False)
df = pd.read_csv(StringIO(response.text), index_col=0)

print(df.head())


# 写CSV

## 保存csv时重排列顺序

In [None]:
import pandas as pd
float_df = pd.read_csv('write_ex_3cols.csv', index_col=0)
print(float_df)

# 使用columns来重配列顺序
float_df.to_csv('write_ex_3cols_reorder.csv', columns=['b', 'c', 'a'])
pd.read_csv('write_ex_3cols_reorder.csv', index_col=0)


      a   b   c
4  30.0  30   0
1  10.0  10  10
2  20.0  20  20


Unnamed: 0,b,c,a
4,30,0,30.0
1,10,10,10.0
2,20,20,20.0


## 不写入index和columns



In [None]:
import pandas as pd
import numpy as np

my_dict = {'a':np.random.randint(1,100, 5),
           'b':np.random.randint(1,100, 5),
           'c':np.random.randint(1,100, 5),
          }

my_df = pd.DataFrame(my_dict)
print(my_df)
my_df.to_csv("write_ex_no_index.csv", columns=['b', 'c'], header=False, index=False)
pd.read_csv("write_ex_no_index.csv", header=None)

    a   b   c
0  44  87  53
1  93  70  89
2   6  55  88
3  23  84  63
4  47  82  23


Unnamed: 0,0,1
0,87,53
1,70,89
2,55,88
3,84,63
4,82,23
