Before getting started, you will need to have [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html) and [scikit-learn](https://scikit-learn.org/stable/install.html) installed on your computer.

Run the following command in the terminal (Mac) or the CMD Prompt (Windows).

`pip install pandas scikit-learn`

_Be sure to run all of the code blocks as you read through the notebook._

# Intro to Pandas

`pandas` is designed to make it easier to work with structured data. 

Most of the analyses you might perform will likely involve using tabular data, e.g., from .csv files or relational databases (e.g., SQL) 

If you see a csv file you should be happy!

The `DataFrame` object in `pandas` is "a two-dimensional tabular, column-oriented data structure with both row and column labels."

If you're curious:

>The `pandas` name itself is derived from *panel data*, an econometrics term for multidimensional structured data sets (data where observations are both per time and per individual), and *Python data analysis* itself. After getting introduced, you can consult the full [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/).

In [2]:
import numpy as np
import pandas as pd

Two important data types defined by pandas are  `Series` and `DataFrame`.

## Series

You can think of a `Series` as a "column" of data, such as a collection of observations on a single variable.

In [3]:
s = pd.Series(np.random.randn(4), name='test')
s

0    0.969608
1   -1.442892
2    0.566003
3   -0.593670
Name: test, dtype: float64

Here you can imagine the indices `0, 1, 2, 3` as indexing four listed
companies, and the values being daily returns on their shares.

Pandas `Series` are built on top of NumPy arrays and support many similar
operations

In [4]:
s * 100

0     96.960797
1   -144.289207
2     56.600281
3    -59.367001
Name: test, dtype: float64

In [5]:
np.abs(s)

0    0.969608
1    1.442892
2    0.566003
3    0.593670
Name: test, dtype: float64

But `Series` provide more than NumPy arrays.

Not only do they have some additional (statistically oriented) methods

In [6]:
s.describe()

count    4.000000
mean    -0.125238
std      1.100325
min     -1.442892
25%     -0.805976
50%     -0.013834
75%      0.666904
max      0.969608
Name: test, dtype: float64

But their indices are more flexible

In [7]:
s.index = ['SpaceX', 'BTC', 'GME', 'Amazon']
s

SpaceX    0.969608
BTC      -1.442892
GME       0.566003
Amazon   -0.593670
Name: test, dtype: float64

## DataFrames

A `DataFrame` is an object for storing related columns of data. While a `Series` is a single column of data, a `DataFrame` is several columns, one for each variable. In essence, a `DataFrame` in pandas is analogous to a (highly-optimized) Excel spreadsheet.

It is a powerful tool for representing and analyzing data that are naturally organized  into rows and columns, often with  descriptive indexes for individual rows and individual columns.


In [8]:
from sklearn.datasets import load_boston

boston = load_boston()
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


In [9]:
print(type(df))
df.head(3)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03


We can select particular rows using standard Python array slicing notation.

```
a[start:stop]  # items start through stop-1
a[start:]      # items start through the rest of the array
a[:stop]       # items from the beginning through stop-1
a[:]           # a copy of the whole array
```

In [10]:
# Practice this by selecting a range of rows from `df`

df[4:6]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3.0,222.0,18.7,394.12,5.21


Note that this **cares about the index** so if the index of the dataframe was strings (or anything else than ordered integers) **you may get surprises**.

To select columns, we can pass a list containing the names of the desired columns represented as strings

In [11]:
df[['CRIM',	'ZN']]

Unnamed: 0,CRIM,ZN
0,0.00632,18.0
1,0.02731,0.0
2,0.02729,0.0
3,0.03237,0.0
4,0.06905,0.0
...,...,...
501,0.06263,0.0
502,0.04527,0.0
503,0.06076,0.0
504,0.10959,0.0


To select both rows and columns using integers, the `iloc` attribute should be used with the format `.iloc[rows, columns]`

In [12]:
df.iloc[2:5, 0:4]

Unnamed: 0,CRIM,ZN,INDUS,CHAS
2,0.02729,0.0,7.07,0.0
3,0.03237,0.0,2.18,0.0
4,0.06905,0.0,2.18,0.0
