# Python and Data Analysis 2

**Goal:** The goal of this project is to learn to prepare data for analysis using Pandas.

**Description:** Data often needs to be organized, combined, or cleaned before performing analysis. This project explains a few key ways to accomplish this.

## 2A: Preparing and Cleaning Data

Data *cleaning* refers to the process of ensuring our data is correct and complete. If data is not cleaned, our analysis could produce incorrect results based on corrupt/garbage information. *Preparation* refers to getting our data ready for analysis by putting it into a format we can easily understand.

### Removing Garbage Data

*Garbage data* depends on the type of data we are representing. One common example is if we have a DataFrame that has inconsistent types in its columns.

In [1]:
import pandas as pd

df = pd.DataFrame({'x': [1,'more garbage',3,4,5,6,7,8,9,10],
                  'y': [1,4,9,16,25,36,49,64,'garbage',100]})
print(df)

              x        y
0             1        1
1  more garbage        4
2             3        9
3             4       16
4             5       25
5             6       36
6             7       49
7             8       64
8             9  garbage
9            10      100


Clearly, we want `x` and `y` to exclusively contain numerical values, so rows 1 and 8 contain garbage data. We can try converting them to numbers to fix this.

In [2]:
df = df.apply(pd.to_numeric, errors='coerce') # Go through the DataFrame and force every value to be a number
print(df)

      x      y
0   1.0    1.0
1   NaN    4.0
2   3.0    9.0
3   4.0   16.0
4   5.0   25.0
5   6.0   36.0
6   7.0   49.0
7   8.0   64.0
8   9.0    NaN
9  10.0  100.0


Notice the strings have been replaced `NaN`. This is a special value considered to be a `float`, so it is consistent with the other numerical values in its respective column. However, when NaN is involved in a calculation, it results in NaN output. Therefore, we typically want to remove these values.

In [3]:
# NaN + 4 = NaN
print(df.iloc[1,0] + df.iloc[1,1]) # Add the x value from row 1 (NaN) to the y value from row 1 (4)

nan


We can remove `NaN` values in two ways. If we want to completely delete the row containing the NaN, we can use `dropna()`. If we want to replace the NaN with some constant value, we can use `fillna(replacement-value)`. The first example removes rows with NaN, and the second example replaces the NaNs with the value 0. 

In [4]:
df1 = df.dropna()
print("Dropping NaNs")
print(df1)

df2 = df.fillna(0)
print("\nFilling NaNs")
print(df2)

Dropping NaNs
      x      y
0   1.0    1.0
2   3.0    9.0
3   4.0   16.0
4   5.0   25.0
5   6.0   36.0
6   7.0   49.0
7   8.0   64.0
9  10.0  100.0

Filling NaNs
      x      y
0   1.0    1.0
1   0.0    4.0
2   3.0    9.0
3   4.0   16.0
4   5.0   25.0
5   6.0   36.0
6   7.0   49.0
7   8.0   64.0
8   9.0    0.0
9  10.0  100.0


### Preparing Data

#### Dropping Columns
When getting data ready for analysis, we might want to select subsections of our data and crop out the rest. For example, let's say we are interested in the `date`, `close` and `volume` for Microsoft stock. Using the techniques discussed earlier, we can do the following.

In [5]:
df = pd.read_csv('MSFT.csv')
df = df[['date', 'close', 'volume']]
print(df.head())

         date   close      volume
0  1986-03-13  0.0972  1031788800
1  1986-03-14  0.1007   308160000
2  1986-03-17  0.1024   133171200
3  1986-03-18  0.0998    67766400
4  1986-03-19  0.0981    47894400


#### Changing Data Types
After collecting the data we want, we sometimes need to change the type of a column. A common example is converting `date` from `string` to `datetime`.

In [6]:
df['date'] = pd.to_datetime(df['date']) # Convert 'date' column from string to datetime
print(df['date'].head()) # Display the first few rows of the 'date' column

0   1986-03-13
1   1986-03-14
2   1986-03-17
3   1986-03-18
4   1986-03-19
Name: date, dtype: datetime64[ns]


#### Renaming Columns
We can easily rename our columns with `rename`. We supply a dictionary, mapping the old names to new ones.

In [7]:
df = df.rename(columns={'date': 'Date', 'close': 'Closing Price', 'volume': 'Trading Volume'})
print(df.head())

        Date  Closing Price  Trading Volume
0 1986-03-13         0.0972      1031788800
1 1986-03-14         0.1007       308160000
2 1986-03-17         0.1024       133171200
3 1986-03-18         0.0998        67766400
4 1986-03-19         0.0981        47894400


#### Changing Index
Finally, we might want to relabel our index, by setting its values to be the `Date` column. This is useful if we frequently want to access rows by `Date`.

In [8]:
df = df.set_index('Date')
print(df.head())

            Closing Price  Trading Volume
Date                                     
1986-03-13         0.0972      1031788800
1986-03-14         0.1007       308160000
1986-03-17         0.1024       133171200
1986-03-18         0.0998        67766400
1986-03-19         0.0981        47894400


Note, the new index is the old `Date` column, which has been removed. We can no longer access `Date` using `df['Date']` because it is no longer considered a column. We would instead access the index using `df.index`, `df.index.values`, or `df.index.values.tolist()`.