# `pandas` - Reading DataFrames

__Contents__: This notebook describes the process of reading in a dataframe from a CSV file. This entails:
1. Finding the CSV file and any companion files
1. Visually checking their contents, especially the CSV file
1. Reading the CSV file into a DataFrame
1. Check the datatypes of DataFrame columns 
1. Possibly reread the DataFrame with different parameters and recheck the datatypes.

Related/useful documentation
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Load libraries.

In [5]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

## Finding the Files

In [7]:
%sh ls /dbfs/mnt/datalab-datasets/file-samples/import*

### Checking File Content

In [9]:
%sh cat /dbfs/mnt/datalab-datasets/file-samples/imports-85.names

In [10]:
%sh head /dbfs/mnt/datalab-datasets/file-samples/imports-85.csv

The CSV file does not have a header so the columns names are retrieved from the documentation.

In [12]:
column_names = ['symboling', 'normalized-losses', 'make', 'fuel-type',
                'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
                'engine-location', 'wheel-base', 'length', 'width',
                'height', 'curb-weight', 'engine-type', 'num-of-cylinders',
                'engine-size', 'fuel-system', 'bore', 'stroke',
                'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
                'highway-mpg', 'price']

## Read CSV File

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

In [15]:
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                       )

## Check Datatypes

Look for numeric columns that were read in as `object` (character) types.

In [18]:
import_df.select_dtypes(include=['object']).head()

Notice `normalized-losses` in the listing above. It contains numeric data. 

Although the `bore`, `stroke`, `horsepower`, `peak-rpm` and `price` variables are also read as `object` type, it may happen that fixing `normalized-losses` may fix these variables too. (It will.)

Check the values of `normalized-losses`.

In [21]:
import_df['normalized_losses'].unique()

They are all numeric except for the `?`. We reread the CSV file and include a parameter to designate the `?` as a NA value.

In [23]:
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )

Check the type of `normalized-losses`.

In [25]:
import_df['normalized_losses'].dtype

It is now `float64`.

Recheck the columns of type `object`.

In [28]:
import_df.select_dtypes(include=['object']).head(5)

In [29]:
import_df.select_dtypes(include=['object']).head(5).transpose()

Notice `normalized-losses` and the others are no longer in the list. All of these columns are correctly typed.

Check all of the columns for their types. Notice that in addition to `object` we have only `int64` and `float64`.

In [32]:
import_df.dtypes

Check the first five values of each of the `int64` and `float64` variables.

In [34]:
import_df.select_dtypes(include=['int64','float64']).head()

This looks good. 
- The variables/columns are typed correctly. 
- The `?` in the CSV has been coded as NA, specifically as numpy `NaN`.

The following command reads in the data set correctly and will be used in the following notebooks.

In [36]:
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )

### DataFrame Summaries
The remainder of the notebook consists of a few methods that provide basic summaries of DataFrames.

The `columns` attribute lists the column names.

In [39]:
import_df.columns

The output above can be converted into a list of strings with the `list` function.

The `dtypes` attribute lists the column names and their types.

In [42]:
import_df.dtypes

The `info` method lists the name, number of non-null values and dtype of each column.

In [44]:
import_df.info()

The following command lists the number of null values in all columns that have null values.

In [46]:
x = import_df.isnull().sum().sort_values(ascending=False)[lambda x: x > 0]
print(type(x))
x

In [47]:
import_df.isnull()

In [48]:
import_df.isnull().sum()

In [49]:
import_df.isnull().sum().sort_values(ascending=False)

In [50]:
import_df.isnull().sum().sort_values(ascending=False)[lambda x: x > 0]

__The End__