# Lab 2 (Due @ by 11:59 pm via Gradescope)

Your Name:

Due: Tuesday, Sept. 30 @ 11:59 pm

### Submission Instructions
Submit this `ipynb` file to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the `ipynb` file to gradescope. **In addition:**
- Make sure your name is entered above
- Make sure you comment your code effectively
- If problems are difficult for the TAs/Profs to grade, you will lose points

### Tips for success
- Collaborate: bounce ideas off of each other, if you are having trouble you can ask your classmates or Dr. Singhal for help with specific issues, however...
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), i.e. you are welcome to **talk about** (*not* show each other your answers to) the problems.

In [2]:
# you might use the below modules on this lab
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## Part 1: Understanding Cleaning
### Part 1.1: Grabbing Data and Preliminary Cleaning (10 points)

We wish to create a data frame that includes all the spells for each class (a "class" is something like a "wizard", or a "bard") in Dungeons and Dragons 5th Edition, which you can find [here](http://dnd5e.wikidot.com/). Your final data frame should look something like:

| Class     | Level     | Spell Name    | School      | Casting Time | Range                | Duration      | Components |
|----------:|----------:|--------------:|------------:|-------------:|---------------------:|--------------:|-----------:|
| Artificer | Level 0   | Acid Splash   | Conjuration | 1 Action     | 60 Feet              | Instantaneous | V, S       |
| Artificer | Level 0   | Booming Blade | Evocation   | 1 Action     | Self (5-foot radius) | 1 Round       | S, M       |
| ...       | ...       | ...           | ...         | ...          | ...                  | ...           | ...        |
| Wizard    | Level 9   | Wish          | Conjuration | 1 Action     | Self                 | Instantaneous | V          |

Below are two functions which:
- takes a class (string) as an argument and returns the tables from the class's DND wiki spell page in a dictionary for each spell level
- takes a list of classes, applies the first function to each of them, then combines all the tables into a data frame, including a column with class name and a column with spell level

**DO NOT CHANGE ANYTHING IN THE BODY OF THE FUNCTIONS.**

**In a markdown cell** create a bullet point list where you explain each what each chunk of code does. Your bullet point list should have **FIVE** bullet points/explanations corresponding to the four chunks below the `# EXPLAIN THIS (number)` comments. You must accurately summarize the content and procedure of each code chunk.


In [3]:
def get_class_spell_dict(dnd_class):
    """ takes a D&D class (string) and gets the spell tables and saves them in a dictionary
    
    Args:
        dnd_class (str): the D&D class
        
    Returns:
        table_dict (dict): a dictionary of tables, one for each spell level
    """

    # EXPLAIN THIS (1)
    url = f'http://dnd5e.wikidot.com/spells:{dnd_class}'
    tables = pd.read_html(url)
    table_dict = {}
    for i in range(len(tables)):
        table_dict[f'Level {i}'] = tables[i]

    return table_dict

def get_full_spell_df(class_list):
    """ takes a list of D&D classes (list of strings), applies the get_class_spell_dict() function to them, and then combines them into a data frame

    Args:
        class_list (list): a list of strings

    Returns:
        spells_df (data frame): a data frame with all the spells
    """

    spells_df = pd.DataFrame()
    level_list = []
    long_class_list = []
    
    # EXPLAIN THIS (2)
    for class_ in class_list:
        class_dict = get_class_spell_dict(class_)
        class_df = pd.DataFrame()

        # EXPLAIN THIS (3)
        for level in class_dict:
            level_list.append([level] * len(class_dict[level]))
            class_df = pd.concat([class_df, class_dict[level]])

        # EXPLAIN THIS (4)
        long_class_list.append([class_] * len(class_df))
        spells_df = pd.concat([spells_df, class_df])

    # EXPLAIN THIS (5)
    spells_df.insert(0, 'Level', [item for sublist in level_list for item in sublist])
    spells_df.insert(0, 'Class', [item for sublist in long_class_list for item in sublist])
    
    return spells_df

class_list = ['Artificer', 'Bard', 'Cleric', 'Druid', 'Paladin', 'Ranger', 'Sorcerer', 'Warlock', 'Wizard']
notclean_df = get_full_spell_df(class_list)
notclean_df

Unnamed: 0,Class,Level,Spell Name,School,Casting Time,Range,Duration,Components
0,Artificer,Level 0,Acid Splash,Conjuration,1 Action,60 Feet,Instantaneous,"V, S"
1,Artificer,Level 0,Booming Blade,Evocation,1 Action,Self (5-foot radius),1 round,"S, M"
2,Artificer,Level 0,Create Bonfire,Conjuration,1 Action,60 Feet,"Concentration, up to 1 minute","V, S"
3,Artificer,Level 0,Dancing Lights,Evocation,1 Action,120 feet,Concentration up to 1 minute,"V, S, M"
4,Artificer,Level 0,Fire Bolt,Evocation,1 Action,120 feet,Instantaneous,"V, S"
...,...,...,...,...,...,...,...,...
13,Wizard,Level 9,Time Ravage,Necromancy DC,1 Action,90 feet,Instantaneous,"V, S, M"
14,Wizard,Level 9,Time Stop,Transmutation,1 Action,Self,Instantaneous,V
15,Wizard,Level 9,True Polymorph,Transmutation,1 Action,30 feet,"Concentration, up to 1 hour","V, S, M"
16,Wizard,Level 9,Weird,Illusion,1 Action,120 feet,"Concentration, up to 1 minute","V, S"


Your answers here:

-
-
-
-
-

### Part 1.2: More Cleaning (15 points)

The "final" data frame from the previous part is still not as clean as it could be. In a markdown cell, perform these two tasks:

1. Write a short paragraph (at least four sentences) discussing what else you would do to continue cleaning up the data
2. Think about the `Components` column specifically, write out some pseudo code (you can see how I did the below example by double clicking on this cell) that roughly outlines how you would go about cleaning that column

```
def my_cleaning_func(column):
    """ this function cleans a column from a data frame

    Args: column (Series)

    Returns: clean_column (Series)
    """

    # take the column
    # clean the column (I have written comments for these steps, YOU SHOULD WRITE PSEUDO-CODE)
    # save it as clean_column

    return clean_column
```

Your answers here:

-
-

```

```

# Part 2: Summarizing and Visualizing Data

This problem uses `evdataset.csv`, available in the Labs Module on Canvas, which was taken and adapted from Kaggle (no longer hosted) and contains a sample of 194 electric vehicles on the market until 2022. The full dataset includes basic technical specifications, battery capacity and range in various weather and road conditions.

In [4]:
df_ev = pd.read_csv('evdataset.csv', index_col='id')
df_ev.head()

Unnamed: 0_level_0,drive,acceleration,topspeed,electricrange,totalpower,totaltorque,batterycapacity,chargespeed,length,width,height,wheelbase,grossweight
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1647,Rear,7.8,185,390,168,350,77.4,49,4515,1890,1580,2900,2495
1493,AWD,6.2,160,330,215,520,69.7,46,4684,1834,1701,2829,2580
1229,AWD,3.2,260,415,500,850,93.4,46,4963,1966,1381,2900,2880
1252,Rear,5.7,190,470,250,430,83.9,54,4783,1852,1448,2856,2605
1534,Rear,7.9,160,450,150,310,82.0,55,4261,1809,1568,2771,2300


## Part 2.1: Numeric Summaries (25 points)

On your own or with a classmate, discuss which features you think would be most interesting to compare across different drives. Pick two or three of them and, after using `.groupby()` to group by the `drive` feature, calculate for all of them:

- means
- medians
- standard deviations

Then, using the original data set, look at the pairwise correlations (with the correlation matrix, check out the [`pd.corr()` documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)). Finally, **in a markdown cell** discuss your key takeaways from the numeric summaries you calculated, and what the correlations were between your chosen features. Where they among the strongest/weakest relationships? Do you think the type of drive may impact these relationships? Any other interesting results of note?


## Part 2.2: Visual Summaries (25 points)

Again choose two or three features (they can be the same or different as those from the previous part) and make a few plots to further your understanding of the data. For the first two plots, you may use any of `matplotlib`, `seaborn` or `plotly` (you may find some easier to use than others). Please make:

- Histograms for each drive type (i.e. three histograms, one for each of: AWD, Front, Rear) for one of your chosen features. You may make them separately or within a subplot.
- A scatterplot of two of your features, with points colored by drive type.
- Check out the [seaborn plot options again](https://seaborn.pydata.org/examples/index.html) and pick one to use with your chosen features (exercise some thought as to what you are hoping the plot will communicate; you may find it worthwhile to discuss options with your classmates).

Then, **in a markdown cell** discuss what you learned from the plots you created. If you used the same features that you investigated numerically, did the plots corroborate your findings? Or did they provide new insight? If you used new features, what do the plots tell you about what the numeric reationship(s) between the features might be? Ay other interesting results to note?

## Part 2.3: Future Considerations (25 points)

1) Explicitly calculate the variance of all the numeric features in the raw `df_ev` data set, as well as the covariance matrix. 

Then, in a few sentences (**in a markdown cell**) discuss in detail:

(a) why some variances are larger than others, 

(b) why the covariances between the different features are not as useful as the correlations you calculated in Part 2.1 (**pick a couple** of example relationships to illustrate the point(s) you make, 

and (c) if the relationships we see between the features based on the correlation matrix from Part 2.1 are necessarily the true relationships between those features. Think about the meme that was shown in the class:

![d](https://miro.medium.com/v2/resize:fit:547/1*2BnD3YAUBGNutkKiG5dKfg.jpeg)