# Intro to Human-Centered Data Science
## Setting up GitHub and JupyterHub
If you're reading this, you successfully forked the Git repo for this assignment from GitHub. Congratulations!   
  
Now that you've done that, you can see the first Jupyter Notebook for the class. To continue the assignment, let's do a few things with the notebook that we'll often see in data science tasks. But first, let's take a look at the Notebook itself. One of many great things about Jupyter Notebooks is that we can combine text, programming code, and visualizations in the same file.

### Cells
Jupyter Notebooks contain *cells* where you can write code an text, display visualizations, etc. You put content into a cell and then run the cell to get an output. Let's take a closer look at how to enter and run Python code in a cell.   
   
The cell below contains a simple Python statement that adds and prints two numbers. Move your mouse over the cell and click into it. You can run the cell in two ways:
1. When you hover over or click into a cell in Colab, you'll see a triangle icon, the play button, appear on the left side. Click that, and the cell will run.
2. Type Shift+Enter on your keyboard after clicking on the cell.  

In [1]:
## Our first line of Python code. Run this cell by clicking on the run button
## in the top menu or clicking Shift+Enter anywhere in the cell.

25 + 75

100

You should see the number 100 if the cell ran successfully. Each cell will print the last returned value if there is one. For example, you won't see 100 in the code below because...well, let's see what happens.

In [2]:
my_sum = 25 + 75

What happened here is that we assigned the sum of 25 and 75 to a variable named `my_sum`. You'll make a lot of assignments like this.   

But assigning a value to a variable doesn't return anything. If we want to *see* the result when we run a cell, we can just name the variable. Because a variable will always return itself.


In [3]:
my_second_sum = 90 + 10
my_second_sum

100

We could also, at any time, use `print` to display text and variables.

In [4]:
print(25 + 75)
print(my_second_sum)

100
100


`print` is extremely useful. You can also combine it with other text to make longer statements. For example, let's print some explanatory text with the sums.

In [5]:
print("25 + 75 =", 25 + 75)
print(my_second_sum, 'is the sum of 90 and 10')

25 + 75 = 100
100 is the sum of 90 and 10


See how we can combine strings (things in between single or double quotes) with variables or other expressions.  

Here's a slightly more complicated example. You'll see three printed lines after running the next cell telling you if a sentence has has a positive, negative, or neutral tone.

In [6]:
import string

# here's a very simple way to do sentiment analysis
def sentiment_analysis(text):
  # List of positive words
  positive_words = ["good", "great", "excellent", "love", "happy"]

  # List of negative words
  negative_words = ["bad", "terrible", "hate", "sad", "unhappy"]

  # remove punctuation
  text = text.translate(str.maketrans('', '', string.punctuation))

  # Make the text all lowercase. Then split into individual words
  words = text.lower().split()

  # count the number of positive and negative words
  positive_count = sum([1 for word in words if word in
positive_words])
  negative_count = sum([1 for word in words if word in
negative_words])

  # If more positive words than negative words, return 'positive'
  if positive_count > negative_count:
      return 'positive'
  # If more negative words than positive words, return 'negative'
  elif positive_count < negative_count:
      return 'negative'
  # else return 'neutral'
  else:
      return 'neutral'

# Test the function
print("'I had a great time!' is", sentiment_analysis("I had a great time!"))  # Output: positive
print("'I was unhappy!' is", sentiment_analysis("I was unhappy."))  # Output: negative
print("'I feel blah today.' is", sentiment_analysis("I feel blah today."))  # Output: neutral


'I had a great time!' is positive
'I was unhappy!' is negative
'I feel blah today.' is neutral


That's a *very* simplified version of *sentiment analysis*, determining whether a sentence or phrase is positive, negative, or neutral. It's a useful technique in text analysis, and you'll certainly see more realistic implementations of it later in the program.  

The important thing for now is that you can run Python code and see results in Juputer Notebook cells.

### Markdown
The text you're reading now is in a format called [Markdown](https://www.markdownguide.org). Markdown is a simple language that lets you add formatting to plain text documents. You're used to formatting text by clicking on words or phrases and then selecting a format, e.g., bold, italic, Header, etc. In Markdown, you don't do apply formats this way: Instead, we add special codes to the text to specify the way it should appear.`

For example, the first line in this document is a heading. To make it appear large, we add a number sign before (e.g., `# Intro to Human-Centered Data Science`). You can make _italicized_ by adding an asterisk or underscore before and after it (e.g.,  `_Jupyter Notebooks are great!_`). Want text in **bold**? Add two asterisks before and after the text (e.g., `**my bold text**`).  

A Jupyter Notebook cell can be set to edit and display text with Markdown formats. It's a useful way to add documentation to your project with all the benefits of formatted text. [Here's](https://www.markdownguide.org/cheat-sheet/) a useful cheatsheet that describes all the Markdown syntax. Let's practice using some of the codes you'll use often.    


Using the [cheat sheet](https://www.markdownguide.org/cheat-sheet/) as a guide, write a short piece of text introducing yourself below adding the following formatting in Markdown:
1. Include a *heading* with your name.
2. Tell us where you did your undergraduate degree. Format the university in **bold** and *italicize* your major.
3. Tell is why you decided to studay data science. Highlight your response using as a *blockquote*
4. Name three of your hobbies using a *numbered* list.
5. Provide the name of a web site that you often visit and include a *hyperlink* to that site.

### ENTER YOUR TEXT WITH MARKDOWN BELOW  
To edit a Markdown cell, you need to double-click it.  

When you're done entering your text, hit Shift-Enter to run the cell and see the formatted text!


### Importing Python libraries
We learned earlier that the Python programming language has many libraries that provide useful tools for free. We use a Python command called `import` to include a library in a Notebook. For example, the statement `import math` will make Python's math library available in our Notebook. The library lives in what is called a _module_. Think of it as a container filled with lots of math functions and constants. We have use what's called _dot notation_ to access the functions inside the module. Look at the code below for an example:


In [7]:
import math
print(math)

<module 'math' (built-in)>


We used `import` to bring the math library into our Notebook. When we print math, we see its type identifier. Don't worry too much about this except to note that it is a module. Now let's access the content `pi` within the module.

In [8]:
print(math.pi)

3.141592653589793


There's the dot notation. We write `math.pi` to access the constant value 3.14. What happens if we try to ask for pi _without_ the math module?

In [9]:
print(pi)

NameError: name 'pi' is not defined

Look at all of that text above: That's Python's way of signaling an error. As you can guess, 'pi' is undefined. But `math.pi` *is* defined and ours to use once we import math. In fact, we can call lots of math functions using dot notation:

In [10]:
print("The factorial of 5 is =", math.factorial(5))
print("90 degrees in radians =", math.radians(90))
print("The sine of 90 degrees =", math.sin(math.radians(90)))
print("We can represent non-numbers with", math.nan)

The factorial of 5 is = 120
90 degrees in radians = 1.5707963267948966
The sine of 90 degrees = 1.0
We can represent non-numbers with nan


Don't worry too much about the actual math. The important thing to note here is we import libraries as modules, and we access things inside of them with `module_name.function_or_constant_name` (e.g., `math.inf` returns a constant for positive infinity).  

Sometimes we only want to import a few functions or constants from a library. As an example, imagine we only need the constant `pi` and the function `pow` from the math library. We can do the following to just get those two:

In [11]:
from math import pi, pow
print(pi)
print("2 to the 5th power =", pow(2,5))

3.141592653589793
2 to the 5th power = 32.0


We can also import a library and give it an alias. For example, we will us a library named `pandas` **a lot** in this and other courses. You often see the pandas module given the nickname or alias `pd`. Here's how to do that:

In [12]:
import pandas as pd
print(type(pd))

<class 'module'>


You can see that we import pandas *as* `pd`. Then we can use dot notation on `pd` to use functions in the pandas library. We printed `pd` to make sure it's a module.

Now let's use pandas (`pd`) to do something. How about starting by making a basic column of numerical data.

In [13]:
my_series = pd.Series([1,3, 5, 7, 9])
my_series

Unnamed: 0,0
0,1
1,3
2,5
3,7
4,9


We called a function named `Series` to make a one-dimensional array of numbers. Think of a `Series` as a single column or row in a data table.

### Loading data from a file into a `pandas` `DataFrame`

We'll often have data stored in a file or database that we want to load into a Notebook. pandas has a set of [input/output functions](https://pandas.pydata.org/docs/user_guide/io.html) that let us load and store data from HTML, common-separated value (CSV), Excel, SQL, and other files. Let's try a simple example with a CSV file.  

We're going to load a dataset created by [Sling Academy](https://www.slingacademy.com) containing data about high school students. Well, fake high school students. You can learn more about the content of the data from the Sling Academy [overview](https://www.slingacademy.com/article/student-scores-sample-data-csv-json-xlsx-xml/#Overview).
  
We'll use a pandas function, `read_csv`, to read the file. First, we'll specify the location of the file in a variable named `student_scores_URL`. Then we load from that URL with `read_csv`.

In [14]:
student_scores_URL = "https://raw.githubusercontent.com/bsmith-classroom/Github-and-Jupyter-setup/main/student-scores.csv"

df = pd.read_csv(student_scores_URL)


Here's what we just did:
1. We called `pd.read_csv` with the name of our data file, student-scores.csv.
2. `read_csv` loads the file and converts it to a pandas `DataFrame`. A `DataFrame` is like an Excel sheet, a bunch of data elements in rows with different attributes or features in the columns. We assign the new `DataFrame` to a variable called `df`.
3. Now that `df` is defined as a `DataFrame`, we can use dot notation to run `DataFrame` functions on it. One of those is called `head` which returns the first five rows of the data table.

`.head()` is used... a lot. Let's run in now to see what we get.

In [15]:
df.head()

Unnamed: 0,id,first_name,last_name,email,gender,part_time_job,absence_days,extracurricular_activities,weekly_self_study_hours,career_aspiration,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score
0,1,Paul,Casey,paul.casey.1@gslingacademy.com,male,False,3,False,27,Lawyer,73,81,93,97,63,80,87
1,2,Danielle,Sandoval,danielle.sandoval.2@gslingacademy.com,female,False,2,False,47,Doctor,90,86,96,100,90,88,90
2,3,Tina,Andrews,tina.andrews.3@gslingacademy.com,female,False,9,True,13,Government Officer,81,97,95,96,65,77,94
3,4,Tara,Clark,tara.clark.4@gslingacademy.com,female,False,5,False,3,Artist,71,74,88,80,89,63,86
4,5,Anthony,Campos,anthony.campos.5@gslingacademy.com,male,False,5,False,10,Unknown,84,77,65,65,80,74,76


You see that the columns of the table are named things like `first_name`, `last_name`, `gender`, etc. This is a table of (fake) student records, so you'll also see things like `absence_days`, `math_score`, `physics_score`, etc. You can imagine a scenario where these data might be used to understand patterns or relations between variables (e.g., are there correlations betweem math and physics scores). We'll do more of that kind of analysis in the next module.  

Here are some other useful `pandas` functions that we'll use a lot.

In [16]:
df.tail()

Unnamed: 0,id,first_name,last_name,email,gender,part_time_job,absence_days,extracurricular_activities,weekly_self_study_hours,career_aspiration,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score
1995,1996,Alan,Reynolds,alan.reynolds.1996@gslingacademy.com,male,False,2,False,30,Construction Engineer,83,77,84,73,75,84,82
1996,1997,Thomas,Gilbert,thomas.gilbert.1997@gslingacademy.com,male,False,2,False,20,Software Engineer,89,65,73,80,87,67,73
1997,1998,Madison,Cross,madison.cross.1998@gslingacademy.com,female,False,5,False,14,Software Engineer,97,85,63,93,68,94,78
1998,1999,Brittany,Compton,brittany.compton.1999@gslingacademy.com,female,True,10,True,5,Business Owner,51,96,72,89,95,88,75
1999,2000,Natalie,Smith,natalie.smith.2000@gslingacademy.com,female,False,5,False,27,Accountant,82,99,91,69,83,93,100


We should obviously be able to look at the tail of a `DataFrame` if we can look at the head. `.tail()` gives you the last five elements in a `DataFrame`.  

By the way, you can always ask for more than five elements...just put the desired number in the function call. For example, this will give us the last 12 rows of the data:

In [17]:
df.tail(12)

Unnamed: 0,id,first_name,last_name,email,gender,part_time_job,absence_days,extracurricular_activities,weekly_self_study_hours,career_aspiration,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score
1988,1989,Charles,Miller,charles.miller.1989@gslingacademy.com,male,False,2,False,3,Unknown,66,63,95,76,71,99,73
1989,1990,Samuel,Baker,samuel.baker.1990@gslingacademy.com,male,False,2,False,17,Stock Investor,77,86,97,85,77,88,93
1990,1991,Anthony,Moore,anthony.moore.1991@gslingacademy.com,male,False,0,False,17,Software Engineer,98,63,64,82,64,96,79
1991,1992,Charlotte,Rowe,charlotte.rowe.1992@gslingacademy.com,female,False,4,False,27,Lawyer,88,96,79,70,95,90,99
1992,1993,John,Peterson,john.peterson.1993@gslingacademy.com,male,False,2,True,12,Banker,79,83,76,99,100,84,71
1993,1994,Shawn,Ochoa,shawn.ochoa.1994@gslingacademy.com,male,False,3,False,46,Doctor,92,92,91,95,88,94,93
1994,1995,Steven,Lewis,steven.lewis.1995@gslingacademy.com,male,False,1,False,19,Accountant,76,62,90,82,93,71,61
1995,1996,Alan,Reynolds,alan.reynolds.1996@gslingacademy.com,male,False,2,False,30,Construction Engineer,83,77,84,73,75,84,82
1996,1997,Thomas,Gilbert,thomas.gilbert.1997@gslingacademy.com,male,False,2,False,20,Software Engineer,89,65,73,80,87,67,73
1997,1998,Madison,Cross,madison.cross.1998@gslingacademy.com,female,False,5,False,14,Software Engineer,97,85,63,93,68,94,78


And here are the first 9 rows.

In [18]:
df.head(9)

Unnamed: 0,id,first_name,last_name,email,gender,part_time_job,absence_days,extracurricular_activities,weekly_self_study_hours,career_aspiration,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score
0,1,Paul,Casey,paul.casey.1@gslingacademy.com,male,False,3,False,27,Lawyer,73,81,93,97,63,80,87
1,2,Danielle,Sandoval,danielle.sandoval.2@gslingacademy.com,female,False,2,False,47,Doctor,90,86,96,100,90,88,90
2,3,Tina,Andrews,tina.andrews.3@gslingacademy.com,female,False,9,True,13,Government Officer,81,97,95,96,65,77,94
3,4,Tara,Clark,tara.clark.4@gslingacademy.com,female,False,5,False,3,Artist,71,74,88,80,89,63,86
4,5,Anthony,Campos,anthony.campos.5@gslingacademy.com,male,False,5,False,10,Unknown,84,77,65,65,80,74,76
5,6,Kelly,Wade,kelly.wade.6@gslingacademy.com,female,False,2,False,26,Unknown,93,100,67,78,72,80,84
6,7,Anthony,Smith,anthony.smith.7@gslingacademy.com,male,False,3,True,23,Software Engineer,99,96,97,73,88,76,64
7,8,George,Short,george.short.8@gslingacademy.com,male,True,2,True,34,Software Engineer,95,95,82,63,84,70,85
8,9,Stanley,Gutierrez,stanley.gutierrez.9@gslingacademy.com,male,False,6,False,25,Unknown,94,68,94,85,81,74,72


`.info()` gives a nice summary of the `DataFrame`.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   id                          2000 non-null   int64 
 1   first_name                  2000 non-null   object
 2   last_name                   2000 non-null   object
 3   email                       2000 non-null   object
 4   gender                      2000 non-null   object
 5   part_time_job               2000 non-null   bool  
 6   absence_days                2000 non-null   int64 
 7   extracurricular_activities  2000 non-null   bool  
 8   weekly_self_study_hours     2000 non-null   int64 
 9   career_aspiration           2000 non-null   object
 10  math_score                  2000 non-null   int64 
 11  history_score               2000 non-null   int64 
 12  physics_score               2000 non-null   int64 
 13  chemistry_score             2000 non-null   int6

`.info()` gives a few useful pieces of information. First, it has a `RangeIndex` that tells us how many rows are in the `DataFrame`. In this case, we have 2000 rows of data. We can also see the column names in the `DataFrame`. For each column, we see the number of *non-null* values &nbsp; how many values in each column are present vs. missing. Thankfully, we don't have any missing values in this dataset. Finally, we can see the `dtype` or data type for each column. Some of these are `int` for integer, `bool` for boolean (e.g., True or False) or just generic `object`...those are often text strings. `.info()` tells us a lot about the types of data we have and if we need to deal with missing values.

In [20]:
df.describe()

Unnamed: 0,id,absence_days,weekly_self_study_hours,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1000.5,3.6655,17.7555,83.452,80.332,81.3365,79.995,79.5815,81.2775,80.888
std,577.494589,2.629271,12.129604,13.224906,12.736046,12.539453,12.777895,13.72219,12.027087,11.637705
min,1.0,0.0,0.0,40.0,50.0,50.0,50.0,30.0,50.0,60.0
25%,500.75,2.0,5.0,77.0,69.75,71.0,69.0,69.0,72.0,71.0
50%,1000.5,3.0,18.0,87.0,82.0,83.0,81.0,81.0,83.0,81.0
75%,1500.25,5.0,28.0,93.0,91.0,92.0,91.0,91.0,91.0,91.0
max,2000.0,10.0,50.0,100.0,100.0,100.0,100.0,100.0,99.0,100.0


`.describe()` is another function you'll use in almost every Notebook you create. It lets us see summary statistics for all of the numerical data in a `DataFrame`. For instance, focus on the column named `math_score`. You can see the mean math_score is 83.452, the standard deviation from the mean is 12.36, etc. You can also see the minimum and maximum values in each column as well as the quartile values (25%, 50%, 75%). `.describe()` is useful when we start doing exploratory data analysis to get a high-level picture of our data.  

You can also flip the table from columns to rows using `T`, the pandas transpose operator:

In [21]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,2000.0,1000.5,577.494589,1.0,500.75,1000.5,1500.25,2000.0
absence_days,2000.0,3.6655,2.629271,0.0,2.0,3.0,5.0,10.0
weekly_self_study_hours,2000.0,17.7555,12.129604,0.0,5.0,18.0,28.0,50.0
math_score,2000.0,83.452,13.224906,40.0,77.0,87.0,93.0,100.0
history_score,2000.0,80.332,12.736046,50.0,69.75,82.0,91.0,100.0
physics_score,2000.0,81.3365,12.539453,50.0,71.0,83.0,92.0,100.0
chemistry_score,2000.0,79.995,12.777895,50.0,69.0,81.0,91.0,100.0
biology_score,2000.0,79.5815,13.72219,30.0,69.0,81.0,91.0,100.0
english_score,2000.0,81.2775,12.027087,50.0,72.0,83.0,91.0,99.0
geography_score,2000.0,80.888,11.637705,60.0,71.0,81.0,91.0,100.0


Sometimes we need to know the "shape" of the data, how many columns and rows are in the `DataFrame`.

In [22]:
df.shape

(2000, 17)

This tells us there are 2000 rows and 17 columns in the dataset. That seems like a lot...`.size()` can tell us just how many data points we have (number of rows multiplied by number of columns).

In [23]:
df.size

34000

We'll occasionally want to know how many unique values we have in a dataset. In a student dataset like this one, we should expect every `id` and `email` address to be unique. Everything else...well, let's use `.nunique` to find out.

In [24]:
df.nunique()

Unnamed: 0,0
id,2000
first_name,453
last_name,707
email,2000
gender,2
part_time_job,2
absence_days,11
extracurricular_activities,2
weekly_self_study_hours,50
career_aspiration,17


You should see some interesting things from this. 2000 unique email addresses...makes sense. Two unique genders...this suggests we may want to ask additional questions during data collection to capture additional gender identity. In general, `.nunique()` can help us learn about categorical features like gender, especially when we don't know how they were collected.  

There are also times when we want to know the column names, especially when there are a lot of them. `.columns()` will give us what we need.

In [26]:
df.columns

Index(['id', 'first_name', 'last_name', 'email', 'gender', 'part_time_job',
       'absence_days', 'extracurricular_activities', 'weekly_self_study_hours',
       'career_aspiration', 'math_score', 'history_score', 'physics_score',
       'chemistry_score', 'biology_score', 'english_score', 'geography_score'],
      dtype='object')

That's helpful. We can make that a Python list, if needed, with the following call.

In [27]:
list(df.columns)

['id',
 'first_name',
 'last_name',
 'email',
 'gender',
 'part_time_job',
 'absence_days',
 'extracurricular_activities',
 'weekly_self_study_hours',
 'career_aspiration',
 'math_score',
 'history_score',
 'physics_score',
 'chemistry_score',
 'biology_score',
 'english_score',
 'geography_score']

Finally, we may want to know the smallest or largest values in a column. `nsmallest` and `nlargest` will give us those values. Here are examples. First, let's find the five students with the highest math scores.

In [28]:
# find the 5 students with the highest math scores
df.nlargest(5, 'math_score')

Unnamed: 0,id,first_name,last_name,email,gender,part_time_job,absence_days,extracurricular_activities,weekly_self_study_hours,career_aspiration,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score
20,21,Tim,Nichols,tim.nichols.21@gslingacademy.com,male,True,3,False,15,Software Engineer,100,90,72,98,73,97,72
26,27,Jason,Williams,jason.williams.27@gslingacademy.com,male,False,3,False,34,Banker,100,77,80,94,63,90,90
49,50,Sonia,Noble,sonia.noble.50@gslingacademy.com,female,False,0,False,14,Accountant,100,89,90,93,30,83,74
95,96,Victoria,Jones,victoria.jones.96@gslingacademy.com,female,False,2,True,34,Software Engineer,100,98,89,85,81,64,94
98,99,Derrick,Figueroa,derrick.figueroa.99@gslingacademy.com,male,True,0,False,28,Lawyer,100,93,85,61,96,83,86


And then let's find the 7 students with the lowest geography scores.

In [29]:
df.nsmallest(7, 'geography_score')

Unnamed: 0,id,first_name,last_name,email,gender,part_time_job,absence_days,extracurricular_activities,weekly_self_study_hours,career_aspiration,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score
150,151,Courtney,Perry,courtney.perry.151@gslingacademy.com,female,False,3,False,3,Unknown,86,92,79,73,73,61,60
190,191,Penny,Perez,penny.perez.191@gslingacademy.com,female,False,7,False,1,Game Developer,99,82,93,83,92,90,60
269,270,Melissa,Marshall,melissa.marshall.270@gslingacademy.com,female,False,2,False,23,Software Engineer,98,68,80,75,80,81,60
303,304,Amanda,Davis,amanda.davis.304@gslingacademy.com,female,False,1,True,12,Government Officer,65,69,66,65,90,76,60
304,305,Gina,Powell,gina.powell.305@gslingacademy.com,female,True,10,False,4,Business Owner,82,55,76,63,70,61,60
328,329,Matthew,Anderson,matthew.anderson.329@gslingacademy.com,male,False,2,False,27,Lawyer,70,88,60,68,66,82,60
330,331,Sydney,Johnson,sydney.johnson.331@gslingacademy.com,female,False,1,False,25,Software Engineer,91,65,86,66,88,85,60


Notice we used the name of a column to get specific information. You'll do this a lot with any dataset. The most common way you'll see this is to include the column you want in brackets ([]) after the dataset name. For example, we can look at all the values in the `career_aspiration` column with the following command.

In [30]:
df['career_aspiration']

Unnamed: 0,career_aspiration
0,Lawyer
1,Doctor
2,Government Officer
3,Artist
4,Unknown
...,...
1995,Construction Engineer
1996,Software Engineer
1997,Software Engineer
1998,Business Owner


You can also see multiple columns by putting them in a Python list.

In [31]:
df[['career_aspiration', 'chemistry_score', 'math_score']]

Unnamed: 0,career_aspiration,chemistry_score,math_score
0,Lawyer,97,73
1,Doctor,100,90
2,Government Officer,96,81
3,Artist,80,71
4,Unknown,65,84
...,...,...,...
1995,Construction Engineer,73,83
1996,Software Engineer,80,89
1997,Software Engineer,93,97
1998,Business Owner,89,51


You'll see and use this notation, `dataframe_name[dataframe_column]` again and again.

This is a good place to stop. Now return to the [tool assignment](https://docs.google.com/document/d/16imAtemYNYIN9DfgCvOc7bHflKjIAAGqrh9mAES59KA/edit#heading=h.dnvb8051io70) document for the final steps of the assignment.

