## Reading CSV Files


In [1]:
import pandas as pd

df_comma = pd.read_csv('data.csv')
df_comma.head()


Unnamed: 0,ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final
0,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
1,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
2,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
3,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
4,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


In [2]:
df_semicolon = pd.read_csv('data.csv', sep=';')
df_semicolon.head()

Unnamed: 0,"ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final"
0,"27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0"
1,"30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0"
2,"39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0"
3,"28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0"
4,"27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0"


## Changing the header index
We can specify the header line with header=index parameter.

In [3]:
df_other_header = pd.read_csv('data.csv', header=2)
df_other_header.head()


Unnamed: 0,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0.1,91.0
0,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
1,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
2,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


But we can also specify that we don't have any header at all, like this:

In [4]:
df_no_header = pd.read_csv('data.csv', header=None)
df_no_header.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final
1,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
2,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
3,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
4,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0


Since we already have a header, the first line was stored as values.

## Changing header names
We can also change our header names, with or without a header line. Look at the examples, Vini:

In [5]:
labels = ['id', 'name', 'attendance', 'hw', 'test1', 'project1', 'test2', 'project2', 'final']
df_header_label = pd.read_csv('data.csv', names=labels)
df_header_label.head()


Unnamed: 0,id,name,attendance,hw,test1,project1,test2,project2,final
0,ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final
1,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
2,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
3,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
4,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0


In [6]:
df_change_header_label = pd.read_csv('data.csv', header=0, names=labels)
df_change_header_label.head()


Unnamed: 0,id,name,attendance,hw,test1,project1,test2,project2,final
0,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
1,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
2,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
3,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
4,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


## Index
We can change the index values, also! We can specify one or more of our columns to be the index!


In [7]:
df_index_name = pd.read_csv('data.csv', index_col='Name')
df_index_name.head()


Unnamed: 0_level_0,ID,Attendance,HW,Test1,Project1,Test2,Project2,Final
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Joe,27604,0.96,0.97,87.0,98.0,92.0,93.0,95.0
Alex,30572,1.0,0.84,92.0,89.0,94.0,92.0,91.0
Avery,39203,0.84,0.74,68.0,70.0,84.0,90.0,82.0
Kris,28592,0.96,1.0,82.0,94.0,90.0,81.0,84.0
Rick,27492,0.32,0.85,98.0,100.0,73.0,82.0,88.0


In [8]:
df_index_name_id = pd.read_csv('data.csv', index_col=['Name', 'ID'])
df_index_name_id.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,Attendance,HW,Test1,Project1,Test2,Project2,Final
Name,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Joe,27604,0.96,0.97,87.0,98.0,92.0,93.0,95.0
Alex,30572,1.0,0.84,92.0,89.0,94.0,92.0,91.0
Avery,39203,0.84,0.74,68.0,70.0,84.0,90.0,82.0
Kris,28592,0.96,1.0,82.0,94.0,90.0,81.0,84.0
Rick,27492,0.32,0.85,98.0,100.0,73.0,82.0,88.0


## Quiz #1
Use `read_csv()` to read in `cancer_data.csv` and use an appropriate column as the index. Then, use `.head()` on your dataframe to see if you've done this correctly. *Hint: First call `read_csv()` **without parameters** and then `head()` to see what the data looks like.*

In [11]:
df_cancer = pd.read_csv('../001-cancer-predict-dataset/data.csv', index_col='id')
df_cancer.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## Quiz #2
Use `read_csv()` to read in `powerplant_data.csv` with more descriptive column names based on the description of features on this [website](http://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant). Then, use `.head()` on your dataframe to see if you've done this correctly. *Hint: Like in the previous quiz, first call `read_csv()` without parameters and then `head()` to see what the data looks like.*

In [14]:
labels = ['temperature', 'exhaust_vacuum', 'ambient_pressure', 'relative_humidity', 'energy_output']
df_powerplant = pd.read_csv('../powerplantdata.csv', header=0, names=labels)
df_powerplant.head()

Unnamed: 0,temperature,exhaust_vacuum,ambient_pressure,relative_humidity,energy_output
0,8.34,40.77,1010.84,90.01,480.48
1,23.64,58.49,1011.4,74.2,445.75
2,29.74,56.9,1007.15,41.91,438.76
3,19.07,49.69,1007.22,76.79,453.09
4,11.8,40.66,1017.13,97.2,464.43


In [17]:
df_powerplant.to_csv('../powerplant_data_edited.csv', index=False)
