<a href="https://colab.research.google.com/github/stefandanielachirei/licenta_2023_Ceica_Sergiu/blob/main/03_More_DataStructures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 3: More Data Structures

In the previous notebook, we looked at ways to store and work with a collection of data values within a sequence, like a list or an array. We also explored insights we can gain from computing summary statistics on those collections of data. Now we need a way to store and work with **multiple** separate collections of information about the same population - e.g. in addition to the test scores of the class, perhaps we also know the students' IDs, and results from one other exam they have taken in the past. In this notebook, we will work with data structures that allow us to store multiple pieces of information, or **features**, about a population; and we will explore a powerful library that allows us to read in, explore and manipulate datasets.

## Table of Contents
   
**1. [Collections of Multiple Features](#multFeat)**   
&ensp;&ensp;&ensp;&ensp;**1.1.** [Dictionaries](#dict)  
&ensp;&ensp;&ensp;&ensp;**1.2.** [Matrices](#matrix)  
**2. [Pandas Library](#pd)**  
&ensp;&ensp;&ensp;&ensp;**2.1.** [Reading in Data](#import)  
&ensp;&ensp;&ensp;&ensp;**2.2.** [Exploring the Pandas DataFrame](#df)  
&ensp;&ensp;&ensp;&ensp;**2.3.** [Selecting Rows & Columns](#loc)  
&ensp;&ensp;&ensp;&ensp;**2.4.** [Applying Functions](#apply)  
&ensp;&ensp;&ensp;&ensp;**2.5.** [Data Aggregation](#group)  


---

## <u>1. Collections of Multiple Features</u><a id='multFeat'></a>

First, we will explore 2 important data structures that allow us to organize multiple features of a given population: dictionaries and matrices.

### 1.1. Dictionaries<a id='dict'></a>

We can use arrays to collect data points on a single feature of the population, like `test_scores`. When we have multiple arrays capturing different features of the same population, one way to store all of the different arrays is in a **dictionary**.

Here are two examples of a dictionary:

In [None]:
# ex1. dictionary of single values
numerals = {'I': 1, 'V': 5, "X": 10}
numerals

In [None]:
# ex2. dictionary of collections
class_dict = {"student_ID":np.arange(1, len(test_scores)+1),
              "test_scores":test_scores}
class_dict

<i class="fa fa-book" style="font-size:20px;"></i> &nbsp;**Definition:**
<div class="alert alert-success">
A <b>dictionary</b> organizes data into <b>key-value pairs</b>. This allows us to store and retrieve values indexed not by consecutive integers, but by descriptive keys.

- <b>Keys:</b> Strings commonly serve as keys since they enable us to represent names of things. In the context of storing data, they are the column names, or names of the value(s) it represents.
- <b>Values:</b> The data that we are storing. This can be a single value, or a collection of values.
</div>

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **Accessing Dictionary Contents**  

1. Access a dictionary value by indexing the dictionary by the corresponding key:

In [None]:
# 1. get value associated with "test_scores"
class_dict['test_scores']

2. Dictionaries have methods that give us access to a list of its keys, values, and key-value pairs:

In [None]:
## a. list of dictionary keys:
class_dict.keys()

In [None]:
## b. list of dictionary values:
class_dict.values()

In [None]:
## c. list of (key, value) pairs:
class_dict.items()

<div class="alert alert-warning">
<i class="fa fa-info-circle" style="font-size:22px;color:orange"></i> &nbsp; Unlike lists and arrays, dictionary are <b>unordered</b> so the order in which the key:value pairs appear in the dictionary may change when you run code cells.
</div>

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **Adding key-value Pairs**

You can add a new item (key-value pair) into the exiting dictionary by assigning the value to a new name on the dictionary:

```python
dictionary['new_key'] = new_value
```

In [None]:
# adding a new entry
past_scores = np.array([89.0, 94.2, 78.0, 86.2, 81.2, 86.0, 88.3, 84.9, 88.1, 93.0, 82.2, 78.2, 96.1, 95.9, 98.2])

class_dict["past_test_score"] = past_scores
class_dict

In [None]:
class_dict.items()

<div class="alert alert-warning">
<b>Note</b>: There can only be 1 value per key. If you attempt to assign a new value to the dictionary but specify a key name that already exists in the dictionary, the existing values associated with that key will be overwritten.
</div>

---

### 1.2. Matrices<a id='matrix'></a>

Dictionaries organize information by features - all data values capturing student IDs are boxed into one container and saved into a dictionary; and data values about test scores from last year are boxed up into a separate container and saved into the dictionary under a differet label.

But when you think about it, that is not the most helpful way to organize data when you are trying to **make predictions** about a specific case. For instance, say you were trying to guess what animal each record (row) is, given the following features:

||Opposable Thumbs|Class of Animal|Diet |Tail Length |Number of Legs | Flies|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|**0**|True|Mammal|Bananas| long | 2|False|
|**1**|False|Anthropod|Insects| none  |8| False|
|**2**|False|Bird|Fish|short|2|False|

We wouldn't want to look at the data one column at a time, the way dictionaries are organized, when we want to predict what animal record **0** might be. Instead, we'd want to look all the features of the one record at the same time:

||Opposable Thumbs|Animal Class|Diet |Tail Length | Wings |Number of Legs | Flies|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|**0**|True|Mammal|Bananas| long | False | 2|False|

To make a prediction about a particular record, we need to consider all its features - we want to organize information by record, not by column. This is exactly what a **matrix** is designed to do, and it is in this matrix form that we ultimately feed the data into machine learning models.

<i class="fa fa-book" style="font-size:20px;"></i> &nbsp;**Definition:**
<div class="alert alert-success">
    A <b>matrix</b> is a rectangular list*, or <b>a list of lists.</b> We say that matrix $M$ has shape $m \times n$:
    
* It has **m** rows: each row is a list of all features that describe a single record;
* It has **n** columns: each column is displayed as elements in the same position/index of every row, and represents a specific feature of the data

</div>

Our example above in matrix form would look like:

In [None]:
animal_matrix = [[ True,    'Mammal', 'Bananas',  'long', 2, False],
                 [False, 'Anthropod', 'Insects',  'none', 9, False],
                 [False,      'Bird',    'Fish', 'short', 2, False]]

animal_matrix

Compare this to how the data would be represented in a dictionary:

In [None]:
animal_dict = {"Opposable Thumbs": [True, False, False],
                "Class of Animal": ['Mammal', 'Anthropod', 'Bird'],
                           "Diet": ['Bananas', 'Insects', 'Fish'],
                    "Tail Length": ['long', 'none', 'short'],
                 "Number of Legs": [2, 8, 2],
                          "Flies": [False, False, False]}
animal_dict

<div class="alert alert-warning">
<i class="fa fa-info-circle" style="font-size:22px;color:orange"></i> &nbsp;*We use lists in this example to demonstrate what a matrix looks like, since the features are represented by different value types (and values in NumPy arrays must all be of the same type). However, NumPy's representation of the matrix <a href='https://numpy.org/doc/stable/reference/arrays.ndarray.html'><code>ndarray</code></a>, or the <b>n-dimensional array</b>, is usually preferred over using Python lists because NumPy arrays consume less memory and is able to handle operations much more efficiently than lists. Even though we have a mix of data types in our example, that does not mean we are stuck using lists. <b>There are many ways to transform categorical features of datasets into numerical features</b>; figuring out how best to handle categorical variables (like "Diet" and "Animal Class") is a big part of <b>data wrangling</b> for predictive modeling!
</div>

---

## 2. <u>Pandas Library</u><a id='pd'></a>

Now for the exciting part! Up to this point we have been fabricating data in the notebook to serve as our examples. With the introduction of the <b><a href='https://pandas.pydata.org/docs/user_guide/index.html'>Pandas Library</a></b>, we can **import** real data files into Jupyter Notebooks to explore. Let's do that now!

<br>

**Kaggle: our data source**  
We will use the <a href='https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data'>"Breast Cancer Wisconsin (Diagnostic) Dataset"</a> from **Kaggle**. Kaggle is a data science competition platform that hosts datathons, publish datasets, and support an online community of data scientists. Anyone is able to download the cleaned, published datasets to explore from the site and have access to an abundance of resources - from **data dictionaries** that detail data contents, to notebooks and code that other users of the data have posted. It's a great place to find interesting problems to explore and learn from others who have done/are doing the same.

<br>

**Pandas**  
Pandas is the standard tool for working with **dataframes**. A dataframe is a data structure that that represents data in a 2-dimensional table of rows and columns. We've seen a couple of examples of dataframes already, in the section on standard deviations, and just now in the matrix section. They are very useful for exploratory data analysis, data cleaning, and processing before turning them into matrices to be fed into machine learning models.

<div class="alert alert-info"><span style='color:#4169E1'>We've already imported the <code>pandas</code> library, but <b>let's do that again here:</b></span></div>

In [None]:
import pandas as pd

---

### 2.1. Reading in the Data<a id='import'></a>

Pandas allows us to easily "read in" data from a downloaded csv (comma separated values) file and save it as a variable in the Jupyter Notebook, with the `pd.read_csv()` function. It can take many different arguments depending on the desired specifications, but we can just accept the default for the optional parameters. The only required parameter is `filepath_or_buffer`, which asks for the **file path**, or location, of the data file on your computer so that it can find it and turn it into a Pandas dataframe. There are 2 ways to specify the file path:


<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **Absolute File Path:**

All of your files on the computer have a file path. If you go to the location of any file on your File Explorer, you can find its absolute file path by clicking the address bar at the top of the window. You'll see something like:

<code>C:/Users/username/folder/data_folder/filename.csv</code>

When you have all of the information needed to locate the file, all the way to the very first layer of folders, you have an <u>absolute</u> file path.


<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **Relative File Path:**

Your Jupyter notebook (ipynb) file that you are working on has a path, too. When you navigate from one folder to the next on the File Explorer, you often start at a file location (let's call that location A), back out of that folder, enter into another folder, and access the file in this new location (location B). We can do something similar with file paths, by specifying the path of location B **relative to** the location of A.

Let's say this Jupyter notebook is found in location A, whose absolute path is:  
<code>C:/Users/username/folder/myNotebook.ipynb</code>

So, this is the **current directory**, or the location we are starting from:   
<code>C:/Users/username/folder/</code>

From here, we want to get to location B:  
<code>C:/Users/username/folder/data_folder/filename.csv</code>

<div class="alert alert-warning col-md-5 align=center"><b>To do this, we can specify the relative path:</b><br>  
<code>./data_folder/filename.csv</code></div>

<br><br><br><br>

**Notation:**  
- The **`.`** in the relative path indicates we are **staying in the same, current directory**. Since the `data_folder` that contains the desired file is **inside** the current directory we started out in, we indicate that it is from here that we then move into another folder, or identify a file to point to.
<br>

- The **`..`** indicates we need to **back out of the current folder**. It's the equivalent of clicking the back button on File Explorer.  
>**Example**:  <br>
>We can back out of multiple folders - say there is another file in this location we want to get to:
>`C:/Users/another_user/theirFile.csv`
>
>We can access this from location A with the relative path:  
>`../../another_user/theirFile.csv`


Once we have the path, all we need to do it put it in string form, an input it as an argument!

<div class="alert alert-info"><span style='color:#4169E1'><b>Run the code below</b> to read in our first dataset!</span></div>


In [None]:
# using relative path!
df = pd.read_csv("./data/data.csv")
df

This dataset captures measurements and characteristics of breast mass (e.g. mass radius, smoothness, symmery) and the actual diagnosis of the mass. The challenge here would be to predict the diagnosis from the features of the mass. The purpose of this section is to introduce data manipulation using Pandas dataframes and series so we will not be tackling the challenge in this tutorial, but the notebooks uploaded on the <a href='https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data'>Kaggle page</a> would be a great place to see what other people have done with this dataset!

---

### 2.2. Exploring the Pandas DataFrame<a id='df'></a>

The Pandas DataFrame data structure allows us to easily access both the rows (records) **and** columns (features). It can be created in many different ways: from scratch, from a dictionary of values, from a matrix, from reading in a dataset, etc.


<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp;**From scratch**:  

Below is an empty DataFrame object - it has no column or row yet. Run the code to see what it looks like:

In [None]:
df_fromScratch = pd.DataFrame()
df_fromScratch

We can add a column to the dataframe in the same way that we can add new key-value pairs to dictionaries:

In [None]:
df_fromScratch['first_column'] = np.arange(1, 10)
df_fromScratch['second_column']= np.arange(9, 0, -1)
df_fromScratch

However, once one column of a certain **length** is added to a dataframe, all other new columns must be of the same length:

In [None]:
# this will throw an error
df_fromScratch['short_column'] = np.array([7, 8, 9])

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **From a dictionary:**  

We can also convert a dictionary of data into a Pandas DataFrame, as long as the the number of elements captured in each dictionary value is the same:

In [None]:
# animal dictionary from earlier
pd.DataFrame(animal_dict)

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **From a matrix:**  

..and same with matrices. Since a matrix does not have a name value like dictionaries do, we can include an argument to specify the column names:

In [None]:
columnNames = ['Opposable Thumbs', 'Class of Animal', 'Diet', 'Tail Length', 'Number of Legs', 'Flies']
pd.DataFrame(animal_matrix, columns = columnNames)

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **Exploring data contents:**  

Let's explore the Pandas capabilities using breast cancer data we read in earlier. The first step we'd want to take when exploring a dataset is to undertand what the dataset contains. The DataFrame object has many attributes to help us with this task:

1. Identify the number of rows (records) and columns (features) in the data
2. Get info on the column names, their position on the dataframe, how many non-**null\*** values there are in each feature, and what the data type of each feature is
3. Get a list of the columns in the dataset
4. Create a table of summary statistics on all numeric features



In [None]:
# 1. find the number of rows and columns (row, col) in the dataset
df.shape

In [None]:
# 2. summary of features
df.info()

In [None]:
# 3. list of column names
df.columns

In [None]:
# 4. summary statistics of numeric features
df.describe()

<div class="alert alert-warning">
<i class="fa fa-info-circle" style="font-size:22px;color:orange"></i> &nbsp;A <b>null</b> or <b>nan</b> value represents an unknown or missing value in the data - it is an empty entry. If a feature is riddled with missing values, we may need to drop the feature from the investigation since it may not capture enough valid data, or the valid data it does capture may be biased. If a feature has some missing values but still captures valuable information, we will want to clean the data so to replace these values with something more interpretable. We won't touch on this here, but you can find a <a href='https://pandas.pydata.org/docs/user_guide/missing_data.html'>guide on how to work with missing data using Pandas</a> in their documentation.
</div>



<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp;**Selecting a DataFrame feature - Series**:  <br>

We know from the `.info` output above that there is only one non-numeric field in the dataset, and that is the target variable - the diagnosis. Let's understand this target variable better.

We can select a feature from the DataFrame in a similar way to how we would get the value of a dictionary - by indexing the dataframe by the column name:

In [None]:
# get the data values of `diagnosis` column
df['diagnosis']

The extracted column is stored in a Pandas data structure called a **Pandas Series**.

<i class="fa fa-book" style="font-size:20px;"></i> &nbsp;**Definition:**
<div class="alert alert-success">
    A <b>Series</b> is a Pandas data structure that behaves very similarly to NumPy arrays and will be a valid argument to most NumPy functions. Series are also similar to dictionaries, in that its values can have index labels and be indexed by these labels.
</div>

For instance:

In [None]:
# this is an array
array_a = np.arange(1, 6)

# this is a Series, with non-numeric index labels, and a name
series_a = pd.Series(array_a, index=['a', 'b', 'c', 'd', 'e'], name="Series_A")

print("array: ", array_a)
print("\nSeries: ")
print(series_a)

`series_a` has non-numeric indices. If I want to extract a value from the structure, I can index using its positional index (like an array), or using its label index (like a dictionary):

In [None]:
# extracting value like an array
series_a[1]

In [None]:
# extracting value like a dictionary
series_a['b']

A Series can also have a `name` attribute, which is how Pandas knows to name the dataframe when a Series object is turned into a dataframe:

In [None]:
series_a.to_frame()

**Now back to our data.** If we were to predict the diagnosis based on the cancer mass attributes, it would be good to know how many categories of diagnoses there may be. We want to find the unique values of the variable.

Like the DataFrame object, the Pandas Series object also has many useful attributes. Let's use a couple of them here to better understand the field:


In [None]:
# Find all unique values of the field
df['diagnosis'].unique()

There are only 2 possible values for the `diagnosis` variable - malignant (`M`) and benign (`B`). Use the `.value_counts()` method to count how many of each are in the dataset:

In [None]:
df['diagnosis'].value_counts()

### 2.3 Selecting Rows & Columns<a id="loc"></a>

What if we wanted to create subsets of our data, without pulling them out one by one into Pandas Series objects? We will often want to select a set of features (columns) to keep, or filter for data records (rows) that meet specific criteria. There are a number of ways to accomplish this:

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp;**Creating a dataset with fewer selected features:**  <br>

Often times we want to investigate just a couple of fields from the data. In these cases, we may want to create a smaller dataset for greater efficiency and run times. We can select fields to keep in a few ways:

**1. Double square brackets `[]`**  
We can index the dataframe with a list of column names to create a dataset with just those columns (but with all the rows).

In [None]:
df[['diagnosis', 'area_mean']]

**2. `.loc[]` attribute**   
We can do the same using the `.loc` attribute. This method allows us to specify the column names to keep **and** filter the rows at the same time.

Just as we could slice (extract specific ranges of) sequences based their positional indices, we can slice the data rows and data columns by their index labels.

The `.loc[]` method takes two ranges. The range for rows is specified first, and the range for columns second:

df.loc\[ <span style='color:green'>startRowLabel<b> : </b>endRowLabel</span>, <span style='color : navy'>startColName<b> : </b>endColName</span> \]

In [None]:
# grabs all records, and all columns positioned between and including `diagnosis` and `area_mean`
df.loc[:, "diagnosis":"area_mean"]

In [None]:
# if we just want the 2 columns and not the columns in between, we leverage the double-bracket
df.loc[:, ["diagnosis","area_mean"]]

**3. `.iloc[]` - attribute**  
This is very similar to `.loc[]`, but instead of using row and column labels, we specify index positions instead. We can see below that `diagnosis` is found at index `1`, and `area_mean` at index `5`. So, if we specify the range `1:6`, we should get the same table as before.

*Remember that, when slicing with indices, the `stop` value in the `start`:`stop` range is excluded from the selection.*

In [None]:
# run to see column positions
df.columns

In [None]:
# slicing using index positions
df.iloc[:, 1:6]

In [None]:
df.iloc[:, [1, 5]]

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp;**Slicing and Filtering Dataset Records:**  <br>

Just as we can create data with subsets of columns, we can create data with subsets of rows.

**1. Regular indexing**  
When we specify a range of integer values, DataFrames know to slice the rows:

In [None]:
# keep first 100 records
df[:100]

**2. Filtering by criteria**  
We can also filter by a criteria in the data. For instance, what if we only wanted to check out the distributions of features for masses that are known to be "benign"? We would create what we call a **mask**, and apply it to the dataset, like this:


In [None]:
# applying a mask to the dataset
# to only keep records that are benign
df[df['diagnosis']=='B']

Recall that a Series acts very much like a NumPy array. This mean that the expression `df['diagnosis']=='B'` would create a long array of `True` and `False`, depending on whether the element in the `diagnosis` field is `=='B'` or not. This sequence of boolean values acts as a mask on the dataset - the DataFrame knows only to keep records that contains a `True` value from the mask.

In [None]:
# mask
df['diagnosis']=='B'

**2. `.loc[]` and `.iloc[]` attributes**

The `.loc[]` attribute supports slicing and filtering rows, as well:

In [None]:
# slicing rows using index values (which in this case is same as index positions)
df.loc[:100]

The nice thing about `.loc[]` is that it allows you to filter or slice for rows and columns at the same time:

In [None]:
# filtering for benign records
df.loc[df['diagnosis']=='B', 'diagnosis':'area_mean']

We can also slice rows and columns simultaneously with the `.iloc[]` attribute:

In [None]:
df.iloc[:100, [1, 5]]

### 2.4 Feature Engineering & Applying Functions<a id="apply"></a>

Often, we will want to engineer new features from existing ones or transform features in the dataset. We'll approach this in a couple of ways:

1. Computing a new sequence of data with existing ones, and assigning it as a new column
2. Using the `.apply()` DataFrame method

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **1. Creating a new potential feature**</br>

Looking at the available fields, it looks like there might be an opportunity to approximate how *irregularly shaped* a mass may have been. In particular, we are interested in these data fields:

In [None]:
df_new = df.loc[:, ['area_worst', 'radius_worst', 'perimeter_worst', 'symmetry_worst', 'diagnosis']]
df_new

We know that the area of a circle is found by the equation $A = \pi r^2$, and that the circumference of a circle is given by $C = 2 \pi r$. If we make the assumption that the mass is **not** irrecgularly shaped, i.e. the mass has a circular shape, then the measured perimeter of the mass and the calculated circumference should in theory be pretty similar. If the perimeter is larger than the circumference by a lot, that may be a good indicator that there is irregulary in the shape, which may be a good predictor of a malignant mass.

<div class="alert alert-info"><span style='color:#4169E1'>Let's <b>calculate the circumference of the mass</b> given its measured area and radius, and create a new field that captures the <b>ratio of the calculated circumference to the measured perimeter:</b></span></div>



In [None]:
# Series behave like NumPy arrays - the same rules of arithmetic operations apply here

# C = 2*A/r : circumference = 2 x area / radius
circumference = 2*df_new['area_worst']/df_new['radius_worst']

# creating the new ratio field
df_new['ratio_CtoP'] = circumference / df_new['perimeter_worst']

df_new

Nice! We have engineered our first feature.

<br>

<br><i class="fa fa-thumb-tack" style="font-size:16px;"></i>&nbsp; **2. `apply()`**</br>  

`.apply` allows us to take a function and apply it to the Pandas series or dataframe.

<div class="alert alert-info"><span style='color:#4169E1'>Let's <b>standardize the columns <code>area_worst</code>, <code>radius_worst</code>, and <code>perimeter_worst</code></b> by applying the function we had defined earlier:</span></div>


In [None]:
# check the docs for more details!
df.apply?

In [None]:
# function to standardize data
def standard_units(numbers_array):
    "Convert an array of numbers to standard units"
    return (numbers_array - np.mean(numbers_array))/np.std(numbers_array)

In [None]:
# applying the function to the 3 fields
df.loc[:, ['area_worst', 'radius_worst', 'perimeter_worst']].apply(standard_units)

..so much more elegant than extracting each individual field as a Series, plugging them into the function, and setting each new output a as a new column in the dataset!

### 2.5 Data Aggregation<a id="group"></a>
The final concept we will cover is the concept of **grouped operations**. Grouping datasets allow us to efficiently compute and compare aggregations of data values conducted separately for each group of fields.

In the section above, we have just explored shape irregularity as a possible predictor of malignant vs. benign masses. One way to analyze whether we may be onto something is to compute the feature's summary statistic separately for the two groups, and see if we observe a notable difference. We can do this with the `.groupby()` method for Pandas DataFrames, which organizes the data into groups based on the values of the group-by variable, and computes an aggregation on the members of each group such that we are left with an aggregate value for each group. It takes the form:

`df.groupby(group_variable).aggregation()`


<div class="alert alert-info"><span style='color:#4169E1'>Let's <b>find the averages by diagnosis</b> of the new and existing features in the <code>df_new</code> data:</span></div>


In [None]:
df_new.groupby('diagnosis').mean()

Recall that there are only two possible diagnoses: **B**enign, or **M**alignant. We had seen earlier from looking at its `.value_counts()` that there are 357 benign records and 212 malignant records. The `groupby` operation above is calculating the averages for each of the 5 features in the `df_new` dataset, for both the group of 357 benign records and, separately, 212 malignant records. If we were to filter the dataset for benign records, isolate the "area_worst" field, and calculate its mean, we would arrive at the value in the upper left-most cell:

In [None]:
# verify that this matches the mean of "area_worst" for the benign diagnostic group:
print("Benign cases:", df_new[df_new['diagnosis']=="B"]['area_worst'].mean())

Aggregation allows us to quickly consolidate data by a specific category(ies). We can apply built-in aggregators (like `.mean()`) or user-defined aggregating functions, like below:

In [None]:
# mean of standard units should be very close to 0
def meanOfStandardUnits(numbers_array):
    return np.mean(standard_units(numbers_array))

In [None]:
df_new.groupby('diagnosis').aggregate(meanOfStandardUnits)


**Congratulations!** You've made it through the datathon tutorial notebooks. While there is a lot more to learn beyond what's covered in this tutorial when it comes to the art and science of working with data, you have begun to build a solid foundation from which you can dive into the world of data science. As long as you remain curious and leverage the many resources available to you (documentation sites, Kaggle community, Stack Overflow, WiDS workshops, etc.), you are bound to rapidly develop your data science repertoire. Good luck, and have fun!

---

#### Content adapted from:  
- Jupyter Notebook modules from the [UC Berkeley Data Science Modules Program](https://ds-modules.github.io/DS-Modules/) licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)
    - [Data 8X Public Materials for 2022](https://github.com/ds-modules/materials-x22/) by Sean Morris
- [Composing Programs](https://www.composingprograms.com/) by John DeNero based on the textbook [Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/9780262510875/structure-and-interpretation-of-computer-programs/) by Harold Abelson and Gerald Jay Sussman, licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/)  
