# Pandas Tutorial: Importing Data with read_csv()

## Load your CSV file into Python with Pandas

**The first step to any data science project is to import your data.** Often, you'll work with data in Comma Separated Value (CSV) files and run into problems at the very start of your workflow. In this tutorial, you'll see how you can use the `read_csv()` function from `pandas` to deal with common problems when importing data and see why loading flat files specifically with `pandas` has become standard practice for working data scientists today.

## The filesystem

Before you can use `pandas` to import your data, you need to know where your data file is in your filesystem and what your current working directory is. You'll see why this is important very soon, but let's review some basic concepts and Shell commands:

Everything on the computer is stored in the filesystem. 'Directories' is just another word for 'folders', and the working directory is simply the folder you're currently in. Here are some Shell commands you can use to navigate your way in the filesystem:

- The `pwd` command prints the path of your current working directory.
- The `ls` command lists all the files and sub-directories (i.e. directories in the current working directory).
- The `cd` command followed by the name of a sub-directory allows you to change your working directory to the sub-directory you specify.

There's a file called `cereal.csv` that contains [nutrition data on 80 cereals](https://www.kaggle.com/crawford/80-cereals). Using the commands above, can you locate it?

------------------------------------------------------------------------------------------------------
*Insert DataCamp's Terminal console here for readers to interact with for early engagement*

*Like this but with Terminal console embedded if possible, if not, embed the IPython console and access file using `! ls` and `%cd` commands*
![](embedded_console.png)

------------------------------------------------------------------------------------------------------
Great!

Now that you know what your current working directory is and where the data is in the filesystem, you can specify the file path to it. You're now ready to import the CSV file into Python using `read_csv()` from `pandas`. Note that the `pandas` library is usually imported under the alias `pd`:

In [3]:
import pandas as pd
cereal_df = pd.read_csv("/Users/mm82089/dc/toots/data/cereal.csv")
cereal_df2 = pd.read_csv("data/cereal.csv")

# Are they the same?
print(pd.DataFrame.equals(cereal_df, cereal_df2))

FileNotFoundError: File b'/Users/mm82089/dc/toots/data/cereal.csv' does not exist

As you can see in the code chunk above, the file path is the first argument to `read_csv()` and it was specified in two ways. You can use the full file path which includes the working directory or just use the relative file path. The `read_csv()` function is smart enough to decipher whether it's working with full or relative file paths and convert your flat file as a DataFrame with ease.

Continue on and see how else `pandas` demonstrates its great utility in importing CSV files. Let's use some of the function's customizable options, particularly for the way it deals with headers, incorrect data types, and missing data.

## Dealing with headers

Headers refer to the column names of your dataset. For some datasets you might encounter, the headers may be completely missing, partially missing, or they may exist, but you may want to rename them. How do we deal with such issues effectively?

Let's take a closer look at our data:

In [22]:
import pandas as pd
df = pd.read_csv("data/cereal.csv")
print(df.head(5))

                         X.1      X.2      X.3       X.4      X.5      X.6  \
0                       name      mfr     type  calories  protein      fat   
1                  100% Bran        N        C        70        4        1   
2          100% Natural Bran        Q  no info       120        3        5   
3                   All-Bran  no info        C        70        4        1   
4  All-Bran with Extra Fiber        K        C        50        4  no info   

      X.7    X.8      X.9    X.10     X.11      X.12   X.13     X.14  X.15  \
0  sodium  fiber    carbo  sugars   potass  vitamins  shelf   weight  cups   
1     130     10  no info       6      280        25      3        1  0.33   
2      15      2        8       8      135         0      3        1     1   
3     260      9        7       5  no info        25      3        1  0.33   
4     140     14        8       0      330        25      3  no info   0.5   

        X.16  
0     rating  
1  68.402973  
2    no info  
3 

It seems like the actual column names are `name`, `mfr`, ..., `rating`, but it's incorrectly imported as the first observation in the dataset! Conveniently, the `read_csv()` function has an argument called `skiprows` that allows you to specify the line numbers to skip (note: it's 0-indexed), or the number of lines to skip at the start of the file. In this case, it seems like you'd want to skip the first line, so let's try importing your CSV file with `skiprows` set equal to 1:

In [23]:
df = pd.read_csv("data/cereal.csv", skiprows = 1)
print(df.head(5))

                        name      mfr     type  calories  protein      fat  \
0                  100% Bran        N        C        70        4        1   
1          100% Natural Bran        Q  no info       120        3        5   
2                   All-Bran  no info        C        70        4        1   
3  All-Bran with Extra Fiber        K        C        50        4  no info   
4             Almond Delight        R        C       110        2        2   

   sodium  fiber    carbo  sugars   potass  vitamins  shelf   weight  cups  \
0     130   10.0  no info       6      280        25      3        1  0.33   
1      15    2.0        8       8      135         0      3        1  1.00   
2     260    9.0        7       5  no info        25      3        1  0.33   
3     140   14.0        8       0      330        25      3  no info  0.50   
4     200    1.0       14       8       -1        25      3        1  0.75   

      rating  
0  68.402973  
1    no info  
2  59.425505  
3 

Nice!

Even when you haven't specified the headers, the `read_csv()` function correctly infers that the first observation contains the headers for the dataset. Not only that, `read_csv()` can infer the data types for each column of your dataset as well. For example, the `calories` column is an integer column, where as the `fiber` column is a float column:

In [24]:
print(df['calories'].dtypes)
print(df['fiber'].dtypes)

int64
float64


## Dealing with missing values and incorrect data types

In `pandas`, columns with a string value are stored as type `object` by default. Because missing values here are encoded as `'no info'`, a string, checking the data type for `fat`, a numeric column with missing values, you can see that its data type isn't ideal:

In [5]:
print(df['fat'].dtypes)

# TODO: check if you can do arithmetic with type `object`... if not, explain why it's not an ideal data type

NameError: name 'df' is not defined

The `fat` column should be treated as type `int64`, and missing data should be encoded as `NaN`. Instead of parsing through each column and replacing `'no info'` with `NaN` values after the data is loaded, you can use the `na_values` argument to take care of this before it's loaded:

In [26]:
df = pd.read_csv("data/cereal.csv", skiprows = 1, na_values = 'no info')
print(df.head(5))

                        name  mfr type  calories  protein  fat  sodium  fiber  \
0                  100% Bran    N    C        70        4  1.0     130   10.0   
1          100% Natural Bran    Q  NaN       120        3  5.0      15    2.0   
2                   All-Bran  NaN    C        70        4  1.0     260    9.0   
3  All-Bran with Extra Fiber    K    C        50        4  NaN     140   14.0   
4             Almond Delight    R    C       110        2  2.0     200    1.0   

   carbo  sugars  potass  vitamins  shelf  weight  cups     rating  
0    NaN       6   280.0        25      3     1.0  0.33  68.402973  
1    8.0       8   135.0         0      3     1.0  1.00        NaN  
2    7.0       5     NaN        25      3     1.0  0.33  59.425505  
3    8.0       0   330.0        25      3     NaN  0.50  93.704912  
4   14.0       8    -1.0        25      3     1.0  0.75  34.384843  


Awesome. Now our data looks clean and all it took was one line of code with `pandas`'s `read_csv()` function.

## Final thoughts
There are other types of data files such as SAS, Excel, Stata you can import into Python using `pandas`. You can learn all the best practices of importing all kinds of data into Python in the [Importing Data in Python](https://www.datacamp.com/courses/importing-data-in-python-part-1) course series on DataCamp. Happy learning!

In [4]:
import pandas as pd
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=No