# Python Foundations of Data Science

In this class, we will learn to work with data in Python. We will learn to import and manipulate some files. Go ahead and hit `Shift + Enter` to run each cell.

## Pandas

Pandas is the workhorse of data science in Python. It is used for reading in a majority of file formats. It is also used for working with the imported files. You will need to import it. In Python, it is frequently aliased as pd.

In [None]:
import pandas as pd

Having imported Pandas, we may proceed to import our files. We will start by importing a CSV, which is a comma-separated values text file. This is a frequently used file format for data interchange because it saves everyone the stress of having to worry about what database the others are using, or what format their data is stored in. The method to read CSV files is `.read_csv()`

In [None]:
file_df = pd.read_csv('./assets/Boston/train.csv')

Pandas will read in the file and return a DataFrame which we can manipulate. We can take a look at the file, using the `.head()` method.

In [None]:
file_df.head()

`.head()` can take a number of parameters, such as how many rows to show. The file has an ID column, which we might want to use as the index of the records we will be working with. Let's proceed to reload our file and show 7 records.

In [None]:
file_df = pd.read_csv('./assets/Boston/train.csv', index_col='ID')
file_df.head(n=7)

We might want to get a list of what column names and types we have loaded into our DataFrame. We can do that with the `.info()` method.

In [None]:
file_df.info()

The method shows us what column types we are working with, as well as how many records in that column are not null. This is important to know because we can't work with null values.

When dealing with numeric values, it is good to get some summary statistics of the columns. This helps with a sanity check. We can do this with `.describe()`. This will give you the mean, standard deviation, minimum, maximum, 25th, 50th, and 75th percentile for each numeric column.

In [None]:
file_df.describe()

There are times when your file is not comma separated. Some files are separated by tabs, |, or some other delimiter. `.read_csv()` let's you specify the separator when you make the call. You can specify whether or not the file you are reading has column headers in the first row, whether there is an index column, what to do with null values, and a lot more. The documentation is available here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

You can read in JSON files using `.read_json()` and Excel files with `read_excel()`.

### Saving Files
After working on a DataFrame, you might have a need to save it for later use. You can use `.to_csv()` to save a file in CSV format. It supports a few options, so you will want to read the documentation.

In [None]:
file_df.to_csv('./output.csv')

### Creating DataFrames

You can create a dataframe from a dictionary.

In [None]:
my_dict = {'col_a': [1, 2, 3], 'col_b': ['a', 'b', 'c']}
my_df = pd.DataFrame(my_dict)
my_df

You can also create a DataFrame from a list

In [None]:
my_list = ['item 1', 'item 2', 'item 3']
my_df = pd.DataFrame(my_list)
my_df

### Slicing DataFrames
You can take subsets from a DataFrame. Let's revisit our imported file

In [None]:
file_df.head()

We can create a new DataFrame by using two square brackets and naming the columns we need

In [None]:
new_df = file_df[['lstat', 'crim', 'medv']]
new_df.head()

We can extract a single column as a Pandas Series. While a DataFrame is a two-dimensional object, a `Series` is a one-dimensional object.

In [None]:
medv = file_df['medv']
medv

We can check the type of the series

In [None]:
print(type(medv))

We can get a Numpy array out of a Series object using `.values`

In [None]:
print(type(medv.values))

In [None]:
medv.values

You can extract and view the index of the DataFrame

In [None]:
my_idx = file_df.index
my_idx

You can sort by column names (axis=1) in either ascending or descending order

In [None]:
file_df.sort_index(axis=1, ascending=False)

You can sort by the contents of a column 

In [None]:
file_df.sort_values(by='medv')

## Types

Data types are implicitely determined

In [None]:
a = 5.0
b = 2
c = 'Hello'

You can use `type()` to find the type of an object

In [None]:
print(type(a))

In [None]:
print(type(b))

In [None]:
print(type(c))

## Arithmetic

You can carry out arithmetic operations on objects

In [None]:
print(a + b)

In [None]:
c = a + b
print(c ** 2)