## Pandas: Creating & Viewing Data

Pandas is a popular package in the Python Machine Learning ecosystem. It is used for data manipulation and analysis and offers many unique features. Pandas is tailored toward working with structured, tabular data, such as the kind of data that you would find in spreadsheets and in relational databases. 

One of Pandas' core data structures is called the DataFrame, which is a two-dimensional table in which data is organized in rows and columns. Pandas is well integrated in the Python Machine Learning ecosystem. Many common ML libraries work with Pandas DataFrames, allowing for a seamless transition from data preparation to model building.

Pandas is the primary package we will use in this program to operate on data. In this exercise you will use Pandas to practice creating data, accessing and viewing data.

For more information about the Pandas package and Pandas DataFrames, consult the Pandas online [API Reference](https://pandas.pydata.org/docs/reference/index.html).

## Step 1

The code cell below imports the `pandas` package, using its conventional shorthand name, `pd`. It then imports the `numpy` package using its conventional shorthand name, `np`. Run the cell below.

In [1]:
import pandas as pd
import numpy as np

## Step 2

In the code cell below, use numpy's `np.random.normal()` function to create two arrays, each with size=100, loc=0.0, and scale=1.0. Name your arrays `a` and `b`.


If you would like to learn how to use the `np.random.normal()` function, use the help <b>?</b> functionality by running the line `np.random.normal?` in the code cell below. Recall that you can use help by specifying the function name (in this case, `np.random.normal`), followed by `?`. You can also access the online [documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html).



### Graded Cell

The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [2]:
a = np.random.normal(size=100, loc=0.0, scale=1.0)

b = np.random.normal(size=100, loc=0.0, scale=1.0)

print( a.shape)

print(b.shape)

(100,)
(100,)


### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [3]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testAB

try:
    p, err = testAB(a,b)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


## Step 3

Now we'll convert our data to a Pandas DataFrame. The `DataFrame()` function is the primary way to create DataFrames. You will use this approach to create a DataFrame out of the two arrays created in the previous step. 

Your DataFrame will contain two columns, named `x1` and `x2`.

1: First create a Python dictionary in which the keys are the names of the columns you want to create, and the values are the arrays. Column `x1` will contain array `a` and column `x2` will contain array `b`. Assign this dictionary to variable `d`.

2: Next use the `pd.DataFrame()` function to convert dictionary `d` to a DataFrame, and assign the result to variable `df`.

For more information on how to use the `DataFrame()` function, consult the online [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame).

### Graded Cell

The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [9]:
a = np.random.normal(size=100, loc=0.0, scale=1.0)

b = np.random.normal(size=100, loc=0.0, scale=1.0)

d = {

    'x1': a,

    'x2': b

}

df = pd.DataFrame(d)

print(df.info())

print(df.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      100 non-null    float64
 1   x2      100 non-null    float64
dtypes: float64(2)
memory usage: 1.7 KB
None
         x1        x2
0 -1.756490  0.199736
1 -0.008365  0.239409
2  0.664400 -0.667464
3 -0.697021  1.509248
4 -0.146670 -0.859329


### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [5]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testDf

try:
    p, err = testDf(d,df,a,b)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


### Step 4

The Pandas `head()` method allows you to view the first <i>n</i> rows in a DataFrame. The syntax is `DataFrame_obj.head()`. You can consult the [online documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) to see how to use this method. You can also use IPython's `help` functionality by typing `df.head?` in the cell below. Recall that you can obtain documentation by specifying the object name (in this case, `df`), followed by the method or property (in this case, `head`), followed by `?`.


In the code cell below, use the `head()` method to view the first 5 rows of dataframe `df`.


In [10]:
# YOUR CODE HERE - this cell will not be graded
print(df.head())


         x1        x2
0 -1.756490  0.199736
1 -0.008365  0.239409
2  0.664400 -0.667464
3 -0.697021  1.509248
4 -0.146670 -0.859329


### DataFrame Indexing and Slicing

As you have seen, DataFrames are two-dimensional tables, labeled by row and column names, containing a set of column-oriented data such that each column has a particular datatype (dtype). Like Python lists, Pandas DataFrames use 0-based indexing. That is, the first row in a DataFrame has an index of 0 and the first column in a DataFrame has an index of 0.

You may be familiar with indexing, which is the process of accessing elements of data in a data structure such as a list or a dictionary by specifying which element(s) one is interested in. Just as you can index into Python lists and dictionaries, you can also index into Pandas DataFrames. Because DataFrames have row and column labels (names), Pandas provides a very useful ability to index, either based on those labels or on the positions of rows and columns. For example: Python lists and NumPy arrays are indexed by integer position (e.g., ```my_list[3]``` or ```my_array[5,7]```), while Python dictionaries are indexed by keys (e.g., ```my_dict['name']```).  Because Pandas DataFrames contain data in a positional order, *and* have row and column labels, they can be indexed both by position and by label. You can also "slice" a Pandas DataFrame. Slicing allows you to select a subset of rows or columns from the DataFrame. Just as with Python lists, slicing is supported using the colon operator ```:``` (you can also use the colon operator to slice into NumPy arrays).


#### Mechanisms for indexing and Slicing

Indexing and slicing DataFrames can be done through various mechanisms: 

* ```[]```
* ```loc[]```
* ```iloc[]```


The first of these, ```[]```, is the square-bracket based indexing that we are familiar with from Python lists and  dictionaries.  Because Pandas DataFrames are generally column-oriented, however, indexing with square brackets results in selecting a particular column based on its label. Each column in the DataFrame is of the Pandas Series datatype, which is a one-dimensional array, similar to a NumPy array. We can loosely think of a DataFrame as something like a dictionary of Series objects: each column label is like a key, and the value associated with that key is the data in that particular column, which is a Pandas Series.  For example, ```df['x2']``` will extract the column labeled 'x2' and will return a Series. 

You can slice a DataFrame using square-bracket based indexing. This will allow you to slice the rows in the DataFrame using an integer range. For example, `df[:3]` will return the first three rows in the DataFrame.

Indexing and slicing a DataFrame using ```loc``` is more general, in that it enables you to access both rows and columns based on their labels.  It uses the same sort of multi-dimensional indexing that you may have seen with other data structures (e.g. `[][]`), although with row and/or column labels rather than integer positional indexes.  By this we mean that the first entry in an index refers to row label, and the second index refers to a column label.  If only one entry is specified, it is used to identify a row in a DataFrame. The syntax for `loc` is `DataFrame_obj.loc[]` or `DataFrame_obj.loc[][]`. 

On the other hand, `iloc` allows you to access rows and columns with specific integer positions (integer-based indexing). The syntax for `iloc` is `DataFrame_obj.iloc[]` or `DataFrame_obj.iloc[][]`.


In the steps below, you will practice indexing and slicing a DataFrame using some of the above methods.  Online Pandas documentation describing indexing and slicing in great detail can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).




### Step 5

Add a new column to your `df` DataFrame. You can add this new column to `df` by:
1. using bracket notation and specifying the label of the new column: <code>df['new_column_label']</code>
2. assigning it the values of the column: 
<code>df['new_column_label'] = column values</code>


Name your new column <b>x3</b>.  Column <b>x3</b> will consist of the sum of the values in column <b>x1</b> and column <b>x2</b>. Hint: You can access each column using bracket notation as seen above, and you can add entire columns together using the addition operation (`+`).

After you have added the new column `x3` to DataFrame `df`, use the `head()` method to examine the new values in `df`.


### Graded Cell

The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [11]:
df['x3'] = df['x1'] + df['x2'] 

print(df.head())

         x1        x2        x3
0 -1.756490  0.199736 -1.556754
1 -0.008365  0.239409  0.231044
2  0.664400 -0.667464 -0.003064
3 -0.697021  1.509248  0.812227
4 -0.146670 -0.859329 -1.005999


### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [10]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testCol

try:
    p, err = testCol(df)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


### Step 6

Oftentimes you may want to access a value in a specific example (row) in your DataFrame. Run the code cell below and examine the values in DataFrame `df`. Note the value in column <b>x1</b> of row <b>5</b> (row with the index of 5).


In [12]:
df.head(6)

Unnamed: 0,x1,x2,x3
0,-1.75649,0.199736,-1.556754
1,-0.008365,0.239409,0.231044
2,0.6644,-0.667464,-0.003064
3,-0.697021,1.509248,0.812227
4,-0.14667,-0.859329,-1.005999
5,-1.377513,-0.382969,-1.760482


Pandas allows you to retrieve this value. It allows you to access a row as if it were a list using either `iloc` or `loc`. You can specify the column and row by using bracket notation and supplying either the label when using `loc` or its numerical index position when using `iloc`. 

Examine the code in the cells below and run the cells to see the results. Notice that they both return the expected value.

In [17]:
# Note that while loc accesses rows based on their labels (names), the code is using an integer value to 
# access the row. 
# This is because in DataFrame df, a row's label is the same as its numerical index position.
# Therefore loc is using the label of 5 to access the row

df.loc[5]['x1']

np.float64(-1.3775129232979613)

In [18]:
df.iloc[5][0]

  df.iloc[5][0]


np.float64(-1.3775129232979613)

### Step 7

You can also access a subset of rows and columns in a DataFrame using slicing. The code cell below retrieves the rows starting at row 10 and ending at row 13.

In [19]:
df[10:14]

Unnamed: 0,x1,x2,x3
10,-0.131331,-0.594717,-0.726048
11,-1.036168,-1.211411,-2.247579
12,-0.848881,-0.135296,-0.984177
13,0.110807,0.839419,0.950226


You can also use slicing to specify every nth value. The code cell below outputs every 8th row in the DataFrame.

In [20]:
df[::8]

Unnamed: 0,x1,x2,x3
0,-1.75649,0.199736,-1.556754
8,-0.401996,-0.25666,-0.658656
16,-1.518802,-0.37715,-1.895952
24,0.515455,0.111137,0.626592
32,0.882042,-0.303527,0.578516
40,0.282551,1.060407,1.342958
48,1.681627,-0.463598,1.218029
56,0.065335,1.782088,1.847423
64,0.622761,-0.908308,-0.285547
72,0.063294,-0.559977,-0.496683


The cell below uses `iloc` to return the column with an index of 2 (column: `x3`) in rows 10-13.

In [21]:
df.iloc[10:14, [2]]

Unnamed: 0,x3
10,-0.726048
11,-2.247579
12,-0.984177
13,0.950226


### Step 8

Often times you may want to select rows based on the rows meeting certain conditions. This can be done with the syntax of `DataFrame_obj[condition]` where the <b>condition</b> represents some boolean condition on the data. 

The code cell below uses this method to find out how many rows have a value greater than 1 in column <b>x3</b>. 
Run the cell below and inspect the results. You will see all of the rows in the DataFrame in which this condition is met.

In [22]:
d = df[df['x3']>1]
d

Unnamed: 0,x1,x2,x3
6,1.679657,1.653854,3.333511
7,0.221154,1.544366,1.76552
19,1.944215,-0.698394,1.24582
23,0.353902,0.719798,1.073701
26,-0.093573,2.436652,2.343079
30,1.29048,0.399251,1.689731
39,-0.8398,2.01738,1.17758
40,0.282551,1.060407,1.342958
46,0.881331,0.599168,1.4805
47,1.304638,0.111471,1.416109


You can use the DataFrame property `shape` to access the row count and see how many rows meet the specified condition. Run the cell below and inspect the results. 

For more information about the `shape` property, consult the online [documentation](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.shape.html).

In [23]:
df[df['x3']>1].shape[0]

23

6     3.333511
7     1.765520
19    1.245820
23    1.073701
26    2.343079
30    1.689731
39    1.177580
40    1.342958
46    1.480500
47    1.416109
48    1.218029
56    1.847423
57    2.531374
58    1.364657
60    1.836325
66    1.729294
71    1.192818
76    2.525975
82    2.625385
83    2.963799
87    1.023528
96    1.674407
98    1.885276
Name: x3, dtype: float64