# What is CSV?

"Comma separated values," or "character separated values."

- You have a bunch of records
- The records are divided into fields

In CSV, we have a text file:
- Every line in the file is a record
- Fields on each line are separated by commas

For example, a file could look like this:

```
a,b,c
d,e,f
```

In [2]:
import pandas as pd

In [3]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', *, sep: 'str | None | lib.NoDefault' = <no_default>, delimiter: 'str | None | lib.NoDefault' = None, header: "int | Sequence[int] | None | Literal['infer']" = 'infer', names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>, index_col: 'IndexLabel | Literal[False] | None' = None, usecols=None, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace: 'bool' = False, skiprows=None, skipfooter: 'int' = 0, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, skip_blank_lines: 'bool' = True, parse_dates: 'bool | Sequence[Hashable] | None' = None, infer_datetime_format: 'bool | lib.NoDefault' = <no_default>, keep_date_col: 'bool' = False, date_parser=<no_default>, date_format: 'str

# Argument 1: Filename or filespec

The first argument can be:
- A string describing a file
- A file-like object, with a file open for reading already
- A URL (in a string)

In [4]:
# I'm going to use one of my favorite files, taxi.csv

df = pd.read_csv('taxi.csv')  

In [5]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


# By default:

- There is a line of headers at the top of the file, naming the columns
- Fields are separated by commas

In [6]:
# what if I give it a URL, rather than a string?

url = 'https://gist.githubusercontent.com/reuven/71307c6e5dc388e9035c0bcf6548897d/raw/47c57184595ba6c528bc343ca0c498e70772e722/gistfile1.txt'
df = pd.read_csv(url)

In [7]:
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2018-06-22,46.529999,46.537498,46.174999,46.23,44.180878,108801600
1,2018-06-25,45.849998,46.23,45.182499,45.5425,43.523846,126652400
2,2018-06-26,45.747501,46.6325,45.634998,46.107498,44.063797,98276800
3,2018-06-27,46.307499,46.82,46.0075,46.040001,43.999287,101141200
4,2018-06-28,46.025002,46.552502,45.950001,46.375,44.319439,69460800


# Argument 2: `sep`

The assumption is that your separator will be `,` (comma).  What if it isn't?

You can pass any string (usually one character long) to `sep` and that will be used to separate the values.

In [8]:
filename = 'mydata3.csv'

df = pd.read_csv(filename)

In [9]:
df.head()

Unnamed: 0,1;10;20;30
0,2;40;50;60
1,3;70;80;90


In [10]:
df = pd.read_csv(filename, sep=';')
df

Unnamed: 0,1,10,20,30
0,2,40,50,60
1,3,70,80,90


# Argument 3: `usecols`

You normally don't want to read all of the columns in from a data file. You want to save memory and processing time by selecting only a few of those columns.

You can do that with `usecols` -- just pass it a list of the columns you want to use.

In [12]:
df = pd.read_csv('taxi.csv',
                 usecols=['passenger_count', 'trip_distance', 'total_amount'])
df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.8
1,1,0.46,8.3
2,1,0.87,11.0
3,1,2.13,17.16
4,1,1.4,10.3


In [13]:
df = pd.read_csv('taxi.csv')
df.columns


Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount'],
      dtype='object')

In [14]:
df = pd.read_csv('taxi.csv',
                 usecols=[3, 4, 18])
df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.8
1,1,0.46,8.3
2,1,0.87,11.0
3,1,2.13,17.16
4,1,1.4,10.3


# Argument 4: `index_col`

If we already know what column (or even multiple columns) we want to use as the index in our data frame, we can specify that when we read the CSV file into memory.



In [15]:
df = pd.read_csv('taxi.csv',
                 usecols=['tpep_pickup_datetime', 'passenger_count',
                          'total_amount', 'trip_distance'],
                 index_col='tpep_pickup_datetime')

In [16]:
df.head()

Unnamed: 0_level_0,passenger_count,trip_distance,total_amount
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-06-02 11:19:29,1,1.63,17.8
2015-06-02 11:19:30,1,0.46,8.3
2015-06-02 11:19:31,1,0.87,11.0
2015-06-02 11:19:31,1,2.13,17.16
2015-06-02 11:19:32,1,1.4,10.3


In [17]:
# instead of passing a single string, I can pass a list of strings,
# and then we will have a multi-indexed data frame

df = pd.read_csv('taxi.csv',
                 usecols=['tpep_pickup_datetime', 'passenger_count',
                          'total_amount', 'trip_distance'],
                 index_col=['tpep_pickup_datetime', 'passenger_count'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_distance,total_amount
tpep_pickup_datetime,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-06-02 11:19:29,1,1.63,17.8
2015-06-02 11:19:30,1,0.46,8.3
2015-06-02 11:19:31,1,0.87,11.0
2015-06-02 11:19:31,1,2.13,17.16
2015-06-02 11:19:32,1,1.4,10.3


# Argument 5: `header`

Most CSV files tend to have the names of the columns on the first row of the file.

- Some files have junk before the headers, which we want to ignore.
- Some files don't have headers at all

We can handle both of these situations with `header`

In [18]:
df = pd.read_csv('mydata.csv')
df.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 4


In [19]:
!head mydata.csv

this is a bunch of stuff
here is some more stuff for you to ignore
a,b,c,d
1,10,20,30
2,40,50,60
3,70,80,90


In [20]:
# I need to tell Pandas to ignore all of the lines until line 2
# meaning: headers are on line 2!

df = pd.read_csv('mydata.csv', header=2)
df.head()

Unnamed: 0,a,b,c,d
0,1,10,20,30
1,2,40,50,60
2,3,70,80,90


In [21]:
# what if there are no headers at all?

!head mydata2.csv

this is a bunch of stuff
here is some more stuff for you to ignore
1,10,20,30
2,40,50,60
3,70,80,90


In [22]:
pd.read_csv('mydata2.csv', header=None)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 4


In [23]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', *, sep: 'str | None | lib.NoDefault' = <no_default>, delimiter: 'str | None | lib.NoDefault' = None, header: "int | Sequence[int] | None | Literal['infer']" = 'infer', names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>, index_col: 'IndexLabel | Literal[False] | None' = None, usecols=None, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace: 'bool' = False, skiprows=None, skipfooter: 'int' = 0, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, skip_blank_lines: 'bool' = True, parse_dates: 'bool | Sequence[Hashable] | None' = None, infer_datetime_format: 'bool | lib.NoDefault' = <no_default>, keep_date_col: 'bool' = False, date_parser=<no_default>, date_format: 'str

In [24]:
df = pd.read_csv('mydata2.csv', 
                 header=None,
                 skiprows=2)

In [25]:
df.head()

Unnamed: 0,0,1,2,3
0,1,10,20,30
1,2,40,50,60
2,3,70,80,90


# Argument # 6: `names`

We can tell Pandas what names to use when we're reading in a file, if the file doesn't provide any headers itself.

In [26]:
df = pd.read_csv('mydata2.csv',
                 header=None,
                 skiprows=2,
                 names=['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,1,10,20,30
1,2,40,50,60
2,3,70,80,90


In [27]:
# what if I want to read in columns a, b, and c (skipping c)
# and I want column a to be the index?

df = pd.read_csv('mydata2.csv',
                 header=None,
                 skiprows=2,
                 usecols=[0, 1, 3],
                 names=['a', 'b', 'd'],
                 index_col='a')

In [28]:
df.head()

Unnamed: 0_level_0,b,d
a,Unnamed: 1_level_1,Unnamed: 2_level_1
1,10,30
2,40,60
3,70,90
