<a href="https://colab.research.google.com/github/villafue/Progamming/blob/main/Python/Tutorial/Pandas/pandas%20Foundations/1%20Data%20ingestion%20%26%20inspection/1_Data_ingestion_%26_inspection_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data ingestion & inspection 

In this chapter, you will be introduced to pandas DataFrames. You will use pandas to import and inspect a variety of datasets, ranging from population data obtained from the World Bank to monthly stock data obtained via Yahoo Finance. You will also practice building DataFrames from scratch and become familiar with the intrinsic data visualization capabilities of pandas. 

# Review of pandas DataFrames

1. Review of pandas DataFrames

Let's learn how to get data in and look at it. We'll need to remember a few things about Pandas first. Pandas is a library for data analysis. The powertool of Pandas is the DataFrame,
2. pandas DataFrames

a tabular data structure with labeled rows & columns. As an example, we'll use a DataFrame with Apple stock data. The rows are labeled by a special data structure called an Index (we'll learn more about Indexes later). Indexes in Pandas are tailored lists of labels that permit fast look-up and some powerful relational operations. The index labels in the aapl DataFrame are dates in reverse chronological order. Labeled rows & columns improves the clarity and intuition of many data analysis tasks.
3. Indexes and columns

When we ask for the type of the object AAPL, it's a DataFrame. When we ask for its shape, it has 8514 rows & 6 columns. The DataFrame columns attribute gives the names of its columns (Open, High, Low, Close, Volume, and Adjusted Close). Notice the aapl-dot-columns are also a Pandas Index.
4. Indexes and columns

Actually, the aapl-dot-index attribute in this case is a special kind of Index called a DatetimeIndex. We'll study DatetimeIndexes and time series later.
5. Slicing

DataFrames can be sliced like NumPy arrays or Python lists using colons to specify the start, end, and stride of a slice. First, we can slice from the start of the DataFrame to the 5th row (non-inclusive) using the dot iloc accessor to express the slice positionally. Second, we can slice from the 5th last row to the end of the DataFrame using a negative index. Remember, it's also possible to slice using labels with the dot loc accessor.
6. head()

There's another way to see just the top rows of a DataFrame: the head method. Specifying head(5) returns the first 5 rows. Specifying head(2) returns the first 2 rows. The head method is particularly useful here because our DataFrame has over 8000 rows. The opposite of head is tail.
7. tail()

Specifying tail without an argument returns the last 5 rows by default. Specifying tail(3) returns the last 3 rows. Again, tail gives a useful summary of a large DataFrames. Another useful summary method is info.
8. info()

info returns other useful summary information, including the kind of Index, the column labels, the number of rows & columns, and the datatype of each column.
9. Broadcasting

Pandas DataFrame slices also support broadcasting (we'll learn more about this later). Here, a slice is assigned a scalar value (in this case, nan or Not a Number). The slice consists of every third row starting from zero in the last column. We can call head(6) to see the changes.
10. Broadcasting

We can also call info and notice the last column has fewer non-null entries than the others due to our assigning nan to every third element.
11. Series

The columns of a DataFrame are themselves a specialized Pandas structure called a Series. Extracting a single column from a DataFrame returns a Series. Notice the Series extracted has its own head method and inherits its name attribute from the DataFrame column. To extract the numerical entries from the Series, use the values attribute. The data in the Series actually form a NumPy array which is what the values attribute yields. A Pandas Series, then, is a 1D labelled NumPy array and a DataFrame is a 2D labelled array whose columns are Series.
12. Let's practice!

We've seen a few concepts extending what we already knew about including head, tail, info, index, values, and Series. Take some time to practice using these concepts in the exercises. 

# Inspecting your data

You can use the DataFrame methods .head() and .tail() to view the first few and last few rows of a DataFrame. In this exercise, we have imported pandas as pd and loaded population data from 1960 to 2014 as a DataFrame df. This dataset was obtained from the [World Bank](http://databank.worldbank.org/data/reports.aspx?source=2&type=metadata&series=SP.URB.TOTL.IN.ZS#).

Your job is to use df.head() and df.tail() to verify that the first and last rows match a file on disk. In later exercises, you will see how to extract values from DataFrames with indexing, but for now, manually copy/paste or type values into assignment statements where needed. Select the correct answer for the first and last values in the 'Year' and 'Total Population' columns.

```
In [4]:
df[['Year', 'Total Population']].head(1)
Out[4]:

   Year  Total Population
0  1960        92495902.0
In [5]:
df[['Year', 'Total Population']].tail(1)
Out[5]:

       Year  Total Population
13373  2014        15245855.0
```

Possible Answers
1. First: 1980, 26183676.0; Last: 2000, 35.
    - Incorrect. The first and last years are not 1980 and 2000.

2. First: 1960, 92495902.0; Last: 2014, 15245855.0.
 - Great work! It's essential to inspect your data like this after you read it in.
3. First: 40.472, 2001; Last: 44.5, 1880.
 - Incorrect. The last year is not 1880.

4. First: CSS, 104170.0; Last: USA, 95.203.
 - Incorrect. The Country Code column is not relevant here.

# DataFrame data types

Pandas is aware of the data types in the columns of your DataFrame. It is also aware of null and NaN ('Not-a-Number') types which often indicate missing data. In this exercise, we have imported pandas as pd and read the world population data into a DataFrame df which contains some NaN values — a value often used as a place-holder for missing or otherwise invalid data entries.

Your job is to use df.info() to determine information about the total count of non-null entries and infer the total count of null entries, which likely indicates missing data.

Select the best description of this data set from the following:

```
In [1]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13374 entries, 0 to 13373
Data columns (total 5 columns):
CountryName                      13374 non-null object
CountryCode                      13374 non-null object
Year                             13374 non-null int64
Total Population                 9914 non-null float64
Urban population (% of total)    13374 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 522.5+ KB
```

Possible Answers

1. The data is all of type float64 and none of it is missing.
 - Incorrect. Notice that there are also int64 and object types.

2. The data is of mixed type, and 9914 of it is missing.
 - Incorrect. The data is indeed of mixed type, but a different number of entries are missing.

3. The data is of mixed type, and 3460 float64s are missing.
 - Incorrect. The data is indeed of mixed type, but a different number of entries are missing.
 
4. The data is all of type float64, and 3460 float64s are missing.
 - Incorrect. While there are indeed 3460 float64 entries missing, all the data is not of type float64.

# NumPy and pandas working together

Pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute .values to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods. In this exercise, we have imported pandas as pd and loaded world population data every 10 years since 1960 into the DataFrame df. This dataset was derived from the one used in the previous exercise.

Your job is to extract the values and store them in an array using the attribute .values. You'll then use those values as input into the NumPy np.log10() method to compute the base 10 logarithm of the population values. Finally, you will pass the entire pandas DataFrame into the same NumPy np.log10() method and compare the results.

Instructions

1. Import numpy using the standard alias np.

2. Assign the numerical values in the DataFrame df to an array np_vals using the attribute values.

3. Pass np_vals into the NumPy method log10() and store the results in np_vals_log10.

4. Pass the entire df DataFrame into the NumPy method log10() and store the results in df_log10.

5. Inspect the output of the print() code to see the type() of the variables that you created.


In [None]:
# Import numpy
import numpy as np

# Create array of DataFrame values: np_vals
np_vals = df.values

# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)

# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]

'''
<script.py> output:
    np_vals has type <class 'numpy.ndarray'>
    np_vals_log10 has type <class 'numpy.ndarray'>
    df has type <class 'pandas.core.frame.DataFrame'>
    df_log10 has type <class 'pandas.core.frame.DataFrame'>
'''

Conclusion

Wonderful work! As a data scientist, you'll frequently interact with NumPy arrays, pandas Series, and pandas DataFrames, and you'll leverage a variety of NumPy and pandas methods to perform your desired computations. Understanding how NumPy and pandas work together will prove to be very useful.

# Building DataFrames from scratch

1. Building DataFrames from scratch

We've seen how to work with DataFrames in memory. But how do we get them in memory?
2. DataFrames from CSV files

In the Intermediate Python for Data Science course, we used read_csv to load a DataFrame from a comma-separated-values file. For instance, here we use a file users dot csv to create a DataFrame called users. The file records visitors to a blog for a band and who signed up for the newsletter. By tracking where visitors come from, this information can help design tours later.
3. DataFrames from dict (1)

DataFrames can also be rolled by hand using dictionaries. Remember, dictionaries (or associative arrays) are a core data structure in Python. Here, we construct a dictionary of lists with the same users data. The keys of the dictionary data are used as column labels. Notice, with no index specified, the row labels are the integers zero to three by default.
4. DataFrames from dict (2)

Let's build the DataFrame users up a different way, using conforming lists cities, signups, visitors and weekdays for the column data. It is useful to be able to build DataFrames from lists because lists are a common Python data structure; it's natural that we might receive data accumulated in lists. We can then define two other lists: list_labels (containing the column labels) and list_cols (containing the column entries for each column). Notice list_cols is a list of lists. Using Python's list and zip functions constructs a list called zipped of tuples (column names and columns) to feed to the dict command.
5. DataFrames from dict (3)

Calling dict(zipped) creates a dict data which is then used with pd dot DataFrame to build the DataFrame.
6. Broadcasting

Let's look again at broadcasting, a convenient technique in NumPy & Pandas. With users in memory, a new column, say fees, can be created on the fly. By using the new column label fees and by assigning the scalar value zero, the value is broadcast to the entire column. Broadcasting saves time in generating long lists, arrays, or columns.
7. Broadcasting with a dict

Broadcasting is not restricted to numbers. Here, we create a dictionary data with column labels height and sex as keys and a list and a single-character string 'M' as values. When the dict data is used to create DataFrame results, the value 'M' is broadcast to the entire column.
8. Index and columns

Remember, we can change the column and index labels using the columns and index attributes of a Pandas DataFrame. We can assign lists of strings to the attributes columns and index as long as they are of suitable length (that is, the number of columns and rows respectively).
9. Let's practice!

It's time for you to practice using other DataFrame construction techniques, broadcasting, and re-labeling. 

Zip lists to build a DataFrame

In this exercise, you're going to make a pandas DataFrame of the top three countries to win gold medals since 1896 by first building a dictionary. list_keys contains the column names 'Country' and 'Total'. list_values contains the full names of each country and the number of gold medals awarded. The values have been taken from [Wikipedia](https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table).

Your job is to use these lists to construct a list of tuples, use the list of tuples to construct a dictionary, and then use that dictionary to construct a DataFrame. In doing so, you'll make use of the list(), zip(), dict() and pd.DataFrame() functions. Pandas has already been imported as pd.

Note: The zip() function in Python 3 and above returns a special zip object, which is essentially a generator. To convert this zip object into a list, you'll need to use list(). You can learn more about the zip() function as well as generators in [Python Data Science Toolbox (Part 2)](https://www.datacamp.com/courses/python-data-science-toolbox-part-2).

Instructions

1. Zip the 2 lists list_keys and list_values together into one list of (key, value) tuples. Be sure to convert the zip object into a list, and store the result in zipped.

2. Inspect the contents of zipped using print(). This has been done for you.

3. Construct a dictionary using zipped. Store the result as data.

4. Construct a DataFrame using the dictionary. Store the result as df.


In [None]:
# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys, list_values))

# Inspect the list using print()
print(zipped)

# Build a dictionary with the zipped list: data
data = dict(zipped)

# Build and inspect a DataFrame from the dictionary: df
df = pd.DataFrame(data)
print(df)

'''
<script.py> output:
    [('Country', ['United States', 'Soviet Union', 'United Kingdom']), ('Total', [1118, 473, 273])]

----------------------------------

              Country  Total
    0   United States   1118
    1    Soviet Union    473
    2  United Kingdom    273
'''

Conclusion

Fantastic! Being able to build DataFrames from scratch is an important skill.

# Labeling your data

You can use the DataFrame attribute df.columns to view and assign new string labels to columns in a pandas DataFrame.

In this exercise, we have imported pandas as pd and defined a DataFrame df containing top Billboard hits from the 1980s (from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_the_1980s#1980)). Each row has the year, artist, song name and the number of weeks at the top. However, this DataFrame has the column labels a, b, c, d. Your job is to use the df.columns attribute to re-assign descriptive column labels.

Instructions

1. Create a list of new column labels with 'year', 'artist', 'song', 'chart weeks', and assign it to list_labels.

2. Assign your list of labels to df.columns.


In [None]:
# Build a list of labels: list_labels
list_labels = ['year', 'artist', 'song', 'chart weeks']

# Assign the list of labels to the columns attribute: df.columns
df.columns = list_labels

Conclusion

Great work! You'll often need to rename column names like this to be more informative.

# Building DataFrames with broadcasting

You can implicitly use 'broadcasting', a feature of NumPy, when creating pandas DataFrames. In this exercise, you're going to create a DataFrame of cities in Pennsylvania that contains the city name in one column and the state name in the second. We have imported the names of 15 cities as the list cities.

Your job is to construct a DataFrame from the list of cities and the string 'PA'.

Instructions

1. Make a string object with the value 'PA' and assign it to state.

2. Construct a dictionary with 2 key:value pairs: 'state':state and 'city':cities.

3. Construct a pandas DataFrame from the dictionary you created and assign it to df.


In [None]:
# Make a string with the value 'PA': state
state = 'PA'

# Construct a dictionary: data
data = {'state':state, 'city':cities}

# Construct a DataFrame from dictionary data: df
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

'''
<script.py> output:
       state             city
    0     PA          Manheim
    1     PA     Preston park
    2     PA      Biglerville
    3     PA          Indiana
    4     PA     Curwensville
    5     PA            Crown
    6     PA     Harveys lake
    7     PA  Mineral springs
    8     PA        Cassville
    9     PA       Hannastown
    10    PA        Saltsburg
    11    PA      Tunkhannock
    12    PA       Pittsburgh
    13    PA        Lemasters
    14    PA       Great bend
'''

Conclusion

Excellent job! Broadcasting is a powerful technique.

# Importing & exporting data

1. Importing & exporting data

Now, let's extend our skills for reading DataFrames from files.
2. Original CSV file

We'll use a comma-separated-values file of sunspot observations collected from SILSO (Sunspot Index & Long-term Solar Observations). The entries date back to the 19th century with over seventy thousand rows.

    1 Source: SILSO, Daily total sunspot number (http://www.sidc.be/silso/infossntotdaily)

3. Datasets from CSV files

The read_csv function requires a string describing a filepath as input. We read into a DataFrame sunspots. Using info, we see the DataFrame has mostly integer or floating-point entries. Notice the index of the DataFrame (the row labels) are of type RangeIndex (just integers).
4. Datasets from CSV files

Let's use the accessor dot iloc to view a slice of the middle of the DataFrame. We can see some of the problems: the column headers don't make sense and there are many perplexing negative one entries in one column. What's going on?
5. Problems

First, the CSV file does not provide column labels in a header row. The column meanings can be gleaned from SILSO's website. Columns zero through two give the Gregorian date, column three is a decimal value of the date, column four is the number of sunspots observed that day, and column five indicates confidence in the measurement (zero or one). Second, the negative ones in column four denote missing values; we need to take care of those. Finally, as written, the dates are awkward for computation, a common problem with CSV files.
6. Using header keyword

Let's tidy this up. Using header equals None prevents pandas from assuming the first line of the file gives column labels. Alternatively, an integer header argument gives the row number (indexed from 0) where column labels actually are and the data begins. Notice, now, the columns & rows are assigned integers from 0 as labels.
7. Using names keyword

We can explicitly label the columns with the option names. We define a list of strings col_names to label the columns properly.
8. Using na_values keyword (1)

We can also read the negative one entries in the sunspots column as NaN or Not-a-Number (sometimes called a null value). We do this with the na_values keyword. We try "na_values equals quote minus one quote" but the sunspots column still has entries of negative one. Looking at the original CSV file reveals the problem; there are space characters preceding minus ones throughout column 4.
9. Using na_values keyword (2)

Thus, we use "na_values equals quote space minus one quote" and it works. Notice the sunspot numbers are now floating-point values (not integers).
10. Using na_values keyword (3)

Several strings can represent invalid or missing values. To do so, we use a list of strings with na_values or a dictionary mapping column names to lists of strings. Note it is possible to use distinct patterns for null values in different columns using dictionaries; see the documentation for examples.
11. Using parse_dates keyword

Finally, we notice the year, month, and date columns can be loaded in a better way. The parse_dates keyword in read_csv infers dates intelligently. We use a list of lists of column positions (indexed from 0) to inform read_csv which columns hold the dates. Sure enough, there's a new column of datetimes named year_month_day amalgamating the three original columns.
12. Inspecting DataFrame

In fact, using the info method, we see the year_month_day column has entries of type datetime64. We'll learn more about datetimes when studying time series; they are invaluable for many time-based computations. Also, the sunspots column has about 69 thousand non-null entries.
13. Using dates as index

The DataFrame still lacks meaningful row labels in the Index. The year_month_day column can be assigned as the DataFrame index using the index attribute. Similarly, assigning date to the index's name attribute gives more concise label. Notice we still have the year_month_day and dec_date columns.
14. Trimming redundant columns

To get rid of them, we list the meaningful column names and extract them. The result is a more compact DataFrame with only the meaningful data.
15. Writing files

What if we want to share this new DataFrame with others? The sensible thing would be to export our compact DataFrame to a new CSV file. The method to_csv does the job for us. Like read_csv, the method to_csv has a host of options to fine-tune its behavior. We can even export to Excel using to_excel method.
16. Let's practice!

Try some exercises now to practice loading and saving DataFrames. 

# Reading a flat file

In previous exercises, we have preloaded the data for you using the pandas function read_csv(). Now, it's your turn! Your job is to read the World Bank population data you saw earlier into a DataFrame using read_csv(). The file is available in the variable data_file.

The next step is to reread the same file, but simultaneously rename the columns using the names keyword input parameter, set equal to a list of new column labels. You will also need to set header=0 to rename the column labels.

Finish up by inspecting the result with df.head() and df.info() in the IPython Shell (changing df to the name of your DataFrame variable).

pandas has already been imported and is available in the workspace as pd.

Instructions

1. Use pd.read_csv() with the string data_file to read the CSV file into a DataFrame and assign it to df1.

2. Create a list of new column labels - 'year', 'population' - and assign it to the variable new_labels.

3. Reread the same file, again using pd.read_csv(), but this time, add the keyword arguments header=0 and names=new_labels. Assign the resulting DataFrame to df2.

4. Print both the df1 and df2 DataFrames to see the change in column names. This has already been done for you.


In [None]:
# Read in the file: df1
df1 = pd.read_csv(data_file)

# Create a list of the new column labels: new_labels
new_labels = ['year', 'population']

# Read in the file, specifying the header and names parameters: df2
df2 = pd.read_csv(data_file, header=0, names=new_labels)

# Print both the DataFrames
print(df1)
print(df2)

'''
<script.py> output:
       Year  Total Population
    0  1960      3.034971e+09
    1  1970      3.684823e+09
    2  1980      4.436590e+09
    3  1990      5.282716e+09
    4  2000      6.115974e+09
    5  2010      6.924283e+09

---------------------------------

       year    population
    0  1960  3.034971e+09
    1  1970  3.684823e+09
    2  1980  4.436590e+09
    3  1990  5.282716e+09
    4  2000  6.115974e+09
    5  2010  6.924283e+09
'''

Conclusion

Well done! Knowing how to read in flat files using pandas is a vital skill.

# Delimiters, headers, and extensions

Not all data files are clean and tidy. Pandas provides methods for reading those not-so-perfect data files that you encounter far too often.

In this exercise, you have monthly stock data for four companies downloaded from [Yahoo Finance](http://finance.yahoo.com/). The data is stored as one row for each company and each column is the end-of-month closing price. The file name is given to you in the variable file_messy.

In addition, this file has three aspects that may cause trouble for lesser tools: multiple header lines, comment records (rows) interleaved throughout the data rows, and space delimiters instead of commas.

Your job is to use pandas to read the data from this problematic file_messy using non-default input options with read_csv() so as to tidy up the mess at read time. Then, write the cleaned up data to a CSV file with the variable file_clean that has been prepared for you, as you might do in a real data workflow.

You can learn about the option input parameters needed by using help() on the pandas function pd.read_csv().

Instructions

1. Use pd.read_csv() without using any keyword arguments to read file_messy into a pandas DataFrame df1.

2. Use .head() to print the first 5 rows of df1 and see how messy it is. Do this in the IPython Shell first so you can see how modifying read_csv() can clean up this mess.

3. Using the keyword arguments delimiter=' ', header=3 and comment='#', use pd.read_csv() again to read file_messy into a new DataFrame df2.

4. Print the output of df2.head() to verify the file was read correctly.

5. Use the DataFrame method .to_csv() to save the DataFrame df2 to the variable file_clean. Be sure to specify index=False.

6. Use the DataFrame method .to_excel() to save the DataFrame df2 to the file 'file_clean.xlsx'. Again, remember to specify index=False.


In [None]:
In [2]:
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file into DataFrame.
    
    Also supports optionally iterating or breaking of the file
    into chunks.
    
    Additional help can be found in the online docs for
    `IO Tools <http://pandas.pydata.org/pandas-docs/stable/io.html>`_.
    
    Parameters
    ----------
    filepath_or_buffer : str, path object, or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be: file://localhost/path/to/table.csv.
    
        If you want to pass in a path object, pandas accepts either
        ``pathlib.Path`` or ``py._path.local.LocalPath``.
    
        By file-like object, we refer to objects with a ``read()`` method, such as
        a file handler (e.g. via builtin ``open`` function) or ``StringIO``.
    sep : str, default ','
        Delimiter to use. If sep is None, the C engine cannot automatically detect
        the separator, but the Python parsing engine can, meaning the latter will
        be used and automatically detect the separator by Python's builtin sniffer
        tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
        different from ``'\s+'`` will be interpreted as regular expressions and
        will also force the use of the Python parsing engine. Note that regex
        delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``.
    delimiter : str, default ``None``
        Alias for sep.
    header : int, list of int, default 'infer'
        Row number(s) to use as the column names, and the start of the
        data.  Default behavior is to infer the column names: if no names
        are passed the behavior is identical to ``header=0`` and column
        names are inferred from the first line of the file, if column
        names are passed explicitly then the behavior is identical to
        ``header=None``. Explicitly pass ``header=0`` to be able to
        replace existing names. The header can be a list of integers that
        specify row locations for a multi-index on the columns
        e.g. [0,1,3]. Intervening rows that are not specified will be
        skipped (e.g. 2 in this example is skipped). Note that this
        parameter ignores commented lines and empty lines if
        ``skip_blank_lines=True``, so ``header=0`` denotes the first line of
        data rather than the first line of the file.
    names : array-like, optional
        List of column names to use. If file contains no header row, then you
        should explicitly pass ``header=None``. Duplicates in this list will cause
        a ``UserWarning`` to be issued.
    index_col : int, sequence or bool, optional
        Column to use as the row labels of the DataFrame. If a sequence is given, a
        MultiIndex is used. If you have a malformed file with delimiters at the end
        of each line, you might consider ``index_col=False`` to force pandas to
        not use the first column as the index (row names).
    usecols : list-like or callable, optional
        Return a subset of the columns. If list-like, all elements must either
        be positional (i.e. integer indices into the document columns) or strings
        that correspond to column names provided either by the user in `names` or
        inferred from the document header row(s). For example, a valid list-like
        `usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
        Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
        To instantiate a DataFrame from ``data`` with element order preserved use
        ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
        in ``['foo', 'bar']`` order or
        ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
        for ``['bar', 'foo']`` order.
    
        If callable, the callable function will be evaluated against the column
        names, returning names where the callable function evaluates to True. An
        example of a valid callable argument would be ``lambda x: x.upper() in
        ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
        parsing time and lower memory usage.
    squeeze : bool, default False
        If the parsed data only contains one column then return a Series.
    prefix : str, optional
        Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
    mangle_dupe_cols : bool, default True
        Duplicate columns will be specified as 'X', 'X.1', ...'X.N', rather than
        'X'...'X'. Passing in False will cause data to be overwritten if there
        are duplicate names in the columns.
    dtype : Type name or dict of column -> type, optional
        Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32,
        'c': 'Int64'}
        Use `str` or `object` together with suitable `na_values` settings
        to preserve and not interpret dtype.
        If converters are specified, they will be applied INSTEAD
        of dtype conversion.
    engine : {'c', 'python'}, optional
        Parser engine to use. The C engine is faster while the python engine is
        currently more feature-complete.
    converters : dict, optional
        Dict of functions for converting values in certain columns. Keys can either
        be integers or column labels.
    true_values : list, optional
        Values to consider as True.
    false_values : list, optional
        Values to consider as False.
    skipinitialspace : bool, default False
        Skip spaces after delimiter.
    skiprows : list-like, int or callable, optional
        Line numbers to skip (0-indexed) or number of lines to skip (int)
        at the start of the file.
    
        If callable, the callable function will be evaluated against the row
        indices, returning True if the row should be skipped and False otherwise.
        An example of a valid callable argument would be ``lambda x: x in [0, 2]``.
    skipfooter : int, default 0
        Number of lines at bottom of file to skip (Unsupported with engine='c').
    nrows : int, optional
        Number of rows of file to read. Useful for reading pieces of large files.
    na_values : scalar, str, list-like, or dict, optional
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values.  By default the following values are interpreted as
        NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
        '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',
        'null'.
    keep_default_na : bool, default True
        Whether or not to include the default NaN values when parsing the data.
        Depending on whether `na_values` is passed in, the behavior is as follows:
    
        * If `keep_default_na` is True, and `na_values` are specified, `na_values`
          is appended to the default NaN values used for parsing.
        * If `keep_default_na` is True, and `na_values` are not specified, only
          the default NaN values are used for parsing.
        * If `keep_default_na` is False, and `na_values` are specified, only
          the NaN values specified `na_values` are used for parsing.
        * If `keep_default_na` is False, and `na_values` are not specified, no
          strings will be parsed as NaN.
    
        Note that if `na_filter` is passed in as False, the `keep_default_na` and
        `na_values` parameters will be ignored.
    na_filter : bool, default True
        Detect missing value markers (empty strings and the value of na_values). In
        data without any NAs, passing na_filter=False can improve the performance
        of reading a large file.
    verbose : bool, default False
        Indicate number of NA values placed in non-numeric columns.
    skip_blank_lines : bool, default True
        If True, skip over blank lines rather than interpreting as NaN values.
    parse_dates : bool or list of int or names or list of lists or dict, default False
        The behavior is as follows:
    
        * boolean. If True -> try parsing the index.
        * list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
          each as a separate date column.
        * list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
          a single date column.
        * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call
          result 'foo'
    
        If a column or index cannot be represented as an array of datetimes,
        say because of an unparseable value or a mixture of timezones, the column
        or index will be returned unaltered as an object data type. For
        non-standard datetime parsing, use ``pd.to_datetime`` after
        ``pd.read_csv``. To parse an index or column with a mixture of timezones,
        specify ``date_parser`` to be a partially-applied
        :func:`pandas.to_datetime` with ``utc=True``. See
        :ref:`io.csv.mixed_timezones` for more.
    
        Note: A fast-path exists for iso8601-formatted dates.
    infer_datetime_format : bool, default False
        If True and `parse_dates` is enabled, pandas will attempt to infer the
        format of the datetime strings in the columns, and if it can be inferred,
        switch to a faster method of parsing them. In some cases this can increase
        the parsing speed by 5-10x.
    keep_date_col : bool, default False
        If True and `parse_dates` specifies combining multiple columns then
        keep the original columns.
    date_parser : function, optional
        Function to use for converting a sequence of string columns to an array of
        datetime instances. The default uses ``dateutil.parser.parser`` to do the
        conversion. Pandas will try to call `date_parser` in three different ways,
        advancing to the next if an exception occurs: 1) Pass one or more arrays
        (as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) the
        string values from the columns defined by `parse_dates` into a single array
        and pass that; and 3) call `date_parser` once for each row using one or
        more strings (corresponding to the columns defined by `parse_dates`) as
        arguments.
    dayfirst : bool, default False
        DD/MM format dates, international and European format.
    iterator : bool, default False
        Return TextFileReader object for iteration or getting chunks with
        ``get_chunk()``.
    chunksize : int, optional
        Return TextFileReader object for iteration.
        See the `IO Tools docs
        <http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_
        for more information on ``iterator`` and ``chunksize``.
    compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
        For on-the-fly decompression of on-disk data. If 'infer' and
        `filepath_or_buffer` is path-like, then detect compression from the
        following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
        decompression). If using 'zip', the ZIP file must contain only one data
        file to be read in. Set to None for no decompression.
    
        .. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
    
    thousands : str, optional
        Thousands separator.
    decimal : str, default '.'
        Character to recognize as decimal point (e.g. use ',' for European data).
    lineterminator : str (length 1), optional
        Character to break file into lines. Only valid with C parser.
    quotechar : str (length 1), optional
        The character used to denote the start and end of a quoted item. Quoted
        items can include the delimiter and it will be ignored.
    quoting : int or csv.QUOTE_* instance, default 0
        Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
        QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
    doublequote : bool, default ``True``
       When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
       whether or not to interpret two consecutive quotechar elements INSIDE a
       field as a single ``quotechar`` element.
    escapechar : str (length 1), optional
        One-character string used to escape other characters.
    comment : str, optional
        Indicates remainder of line should not be parsed. If found at the beginning
        of a line, the line will be ignored altogether. This parameter must be a
        single character. Like empty lines (as long as ``skip_blank_lines=True``),
        fully commented lines are ignored by the parameter `header` but not by
        `skiprows`. For example, if ``comment='#'``, parsing
        ``#empty\na,b,c\n1,2,3`` with ``header=0`` will result in 'a,b,c' being
        treated as the header.
    encoding : str, optional
        Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
        standard encodings
        <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ .
    dialect : str or csv.Dialect, optional
        If provided, this parameter will override values (default or not) for the
        following parameters: `delimiter`, `doublequote`, `escapechar`,
        `skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to
        override values, a ParserWarning will be issued. See csv.Dialect
        documentation for more details.
    tupleize_cols : bool, default False
        Leave a list of tuples on columns as is (default is to convert to
        a MultiIndex on the columns).
    
        .. deprecated:: 0.21.0
           This argument will be removed and will always convert to MultiIndex
    
    error_bad_lines : bool, default True
        Lines with too many fields (e.g. a csv line with too many commas) will by
        default cause an exception to be raised, and no DataFrame will be returned.
        If False, then these "bad lines" will dropped from the DataFrame that is
        returned.
    warn_bad_lines : bool, default True
        If error_bad_lines is False, and warn_bad_lines is True, a warning for each
        "bad line" will be output.
    delim_whitespace : bool, default False
        Specifies whether or not whitespace (e.g. ``' '`` or ``'    '``) will be
        used as the sep. Equivalent to setting ``sep='\s+'``. If this option
        is set to True, nothing should be passed in for the ``delimiter``
        parameter.
    
        .. versionadded:: 0.18.1 support for the Python parser.
    
    low_memory : bool, default True
        Internally process the file in chunks, resulting in lower memory use
        while parsing, but possibly mixed type inference.  To ensure no mixed
        types either set False, or specify the type with the `dtype` parameter.
        Note that the entire file is read into a single DataFrame regardless,
        use the `chunksize` or `iterator` parameter to return the data in chunks.
        (Only valid with C parser).
    memory_map : bool, default False
        If a filepath is provided for `filepath_or_buffer`, map the file object
        directly onto memory and access the data directly from there. Using this
        option can improve performance because there is no longer any I/O overhead.
    float_precision : str, optional
        Specifies which converter the C engine should use for floating-point
        values. The options are `None` for the ordinary converter,
        `high` for the high-precision converter, and `round_trip` for the
        round-trip converter.
    
    Returns
    -------
    DataFrame or TextParser
        A comma-separated values (csv) file is returned as two-dimensional
        data structure with labeled axes.
    
    See Also
    --------
    to_csv : Write DataFrame to a comma-separated values (csv) file.
    read_csv : Read a comma-separated values (csv) file into DataFrame.
    read_fwf : Read a table of fixed-width formatted lines into DataFrame.
    
    Examples
    --------
    >>> pd.read_csv('data.csv')  # doctest: +SKIP
In [3]:

In [None]:
# Read the raw file as-is: df1
df1 = pd.read_csv(file_messy)

# Print the output of df1.head()
print(df1.head())

# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#')

# Print the output of df2.head()
print(df2.head())

# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)

# Save the cleaned up DataFrame to an excel file without the index
df2.to_excel('file_clean.xlsx', index=False)

'''
<script.py> output:
                                                       The following stock data was collect on 2016-AUG-25 from an unknown source
    These kind of comments are not very useful                                                  are they?                        
    Probably should just throw this line away too          but not the next since those are column labels                        
    name Jan Feb Mar Apr May Jun Jul Aug Sep Oct No...                                                NaN                        
    # So that line you just read has all the column...                                                NaN                        
    IBM 156.08 160.01 159.81 165.22 172.25 167.15 1...                                                NaN                        

-------------------------------------------------------------------------------------------------

         name     Jan     Feb     Mar     Apr  ...     Aug     Sep     Oct     Nov     Dec
    0     IBM  156.08  160.01  159.81  165.22  ...  152.77  145.36  146.11  137.21  137.96
    1    MSFT   45.51   43.08   42.13   43.47  ...   45.51   43.56   48.70   53.88   55.40
    2  GOOGLE  512.42  537.99  559.72  540.50  ...  636.84  617.93  663.59  735.39  755.35
    3   APPLE  110.64  125.43  125.97  127.29  ...  113.39  112.80  113.36  118.16  111.73
    
    [4 rows x 13 columns]
'''

Conclusion

Superb! It's important to be able to save your cleaned DataFrames in the desired file format!

# Plotting with pandas

1. Plotting with pandas

Data visualization is a primary tool in a working data scientist's toolbox; let's see how to do it with pandas.
2. AAPL stock data

For convenience, we import pandas as pd and matplotlib dot pyplot as plt. We load the AAPL stock data into a DataFrame using read_csv. Notice the options parse_date equals True and index_col equals 'date' to force a datetime64 index. Again, we'll use these alot with time series shortly. Also observe entries in the volume column significantly in magnitude than other columns.
3. Plotting arrays (matplotlib)

Now, we assign close_arr by indexing aapl 'close' (yielding a Series) and applying the values method (yielding a NumPy array). Remember, the command plot can plot NumPy arrays or lists and the command show must be executed to make the plot visible.
4. Plotting arrays (matplotlib)

This is the resulting plot of stock close prices. Notice the horizontal axis of the plot corresponds to date indices of the array.
5. Plotting Series (matplotlib)

We can actually plot pandas Series directly. We assign close_series from aapl as a Series and call plot with close_series as an argument.
6. Plotting Series (matplotlib)

The result is a similar plot but a bit nicer. The plot function automatically uses the Series's datetime index labels along the horizontal axis.
7. Plotting Series (pandas)

An even nicer alternative is to use the pandas Series plot method; that is, apply close_series dot plot.
8. Plotting Series (pandas)

The result is as before but with even more formatting on the axis labels and the name of the axis (date) inferred from the Index name.
9. Plotting DataFrames (pandas)

In fact, pandas DataFrames have a plot method just like pandas Series. Calling aapl dot plot plots all of the columns of DataFrame aapl on the same axes.
10. Plotting DataFrames (pandas)

Pandas plots each numerical column against the index and uses the column labels in the legend. However, on this scale, we can't see all five line plots because one is so much larger than all the others.
11. Plotting DataFrames (matplotlib)

We can produce a similar plot using plt dot plot from matplotlib (using the DataFrame as an argument).
12. Plotting DataFrames (matplotlib)

This implicitly draws all the numeric columns of aapl against the Index. The figure resembles the one plotted using the DataFrame method plot but there is no legend and no title on the date axis Again, the volume column dominates the other five curves and they cannot be seen on this scale.
13. Fixing scales

To remedy that problem, draw the plot again and call yscale('log'). This matplotlib function sets a logarithmic scale on the vertical axis.
14. Fixing scales

The legend still appears automatically, but now we can distinguish volumes on the order of 10^7 from other price values on the order of 10-squared. Any matplotlib options can be used to customize a Series or DataFrame figure.
15. Customizing plots

For instance, we can extract the open and close Series and plot them separately specifying the colors, line styles, and the legend labels. We zoom the axis in to the year 2001 with vertical scale from 0 to 100 and we explicitly place a legend. To find out more about matplotlib customization, see our course on Data Visualization in Python.
16. Customizing plots

Notice, again, the horizontal date ticks are labelled for us cleanly.
17. Saving plots

Finally, having drawn a figure, it's useful to be able to save it for future use.
18. Saving plots

To obtain the preceding plot, we slice four columns and the rows corresponding to 2001 through 2004 inclusive from the aapl DataFrame (we'll learn more about time series splicing later). We generate a plot and apply savefig to preserve the plot. savefig can infer the file format - for instance, PNG, JPG, PDF, and others - from the suffix of the filename.
19. Let's practice!

Now it's your turn to make some fancy plots using pandas in the exercises! 