##**Overview**
- This notebook will guide you through the process of using Google Colab and Python to create a subset of astronaut data that compares occupation counts by gender.

- The original dataset includes detailed information about astronauts, including their name, sex, occupation, and mission details. This subset focuses specifically on how different occupations are represented among male and female astronauts.

- By the end, you will have a table that shows how often each occupation appears for each gender, saved as a separate CSV file.

- These instructions are written for users with minimal Python experience and cover only basic filtering, counting, and merging methods using pandas.

##**Getting Started**
### 1. Set Up Your Project in Google Colab  
- Open [Google Colab](https://colab.research.google.com/)
- Create a new notebook by selecting **+ New notebook** or **File > New notebook**

### 2. Download the Original `astronauts.csv` File  
- The original file can be found on homepage of this repository
- Once downloaded go to your Google Drive and open your **Colab Notebooks** folder
- Once inside the folder, click on the **+ New** button in the upper left-hand corner
- Select **File upload** and upload the `astronauts.csv` file you downloaded

### 3. Connect Google Colab to Google Drive  
- After uploading the `astronauts.csv` file, go back to your Google Colab notebook.
- Mount your Google Drive to allow the notebook to access the uploaded file. Run the following code to do so:



In [174]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


- **_In order to actually "run" the code_** hover over the line of code and a play button icon ▶ will appear to the left of the code, click this button to "run" the code, afterwards a green check ✔ will appear next to the box containing the line of code if done properly
###4. Import Packages
- To work with the data, we need to use some helpful tools, called **packages**, that allow us to manipulate and analyze the data easily

  - The first tool we will use is **NumPy**. This tool helps with performing mathematical operations on data

  - The second tool we will use is **Pandas**. This tool helps us organize and manage our data in a table format, making it easier to work with. We will import both of these packages using the following code:



In [175]:
import numpy as np
import pandas as pd

- By importing these packages and giving them nicknames (`np` for NumPy and `pd` for Pandas), we can use them more easily throughout our code. For example, instead of typing `pandas.<function>`, we can just type `pd.<function>`. This makes coding faster and more efficient.
###5. Read the Data
-Now that we have the necessary packages imported, it's time to load our data into the notebook so we can work with it. Here's the code you'll need to run:

In [176]:
df=pd.read_csv('gdrive/My Drive/Colab Notebooks/astronauts.csv')

##**Creating the Subsets**
###1. Filter the Data for Male Astronauts
- We start by creating a new subset of the data that only includes male astronauts. This is done using the following code:

In [177]:
male_astronauts = df[df['sex'] == 'male'].copy()

- This line filters the original dataset/dataframe (`df`) to only include rows where the `sex` column is equal to `'male'`
- The `.copy()` method is used to ensure that we create a copy of the filtered data, so we don’t accidentally modify the original dataset
- This new filtered set of data was labeled as `male_astronauts`, but you can name it whatever you want - it's recommended that it's a name you can remember and appropriate for the data
-_Going forward, just remember_ : whatever you put on the **_left side of the_** **`=`** is the name (or reference) you're assigning to the result of the code on the right side. You'll use that name later to refer back to the data you just created

###2. Filter the Data for Female Astronauts
- Next, we will create a subset for female astronauts using the same method:

In [178]:
female_astronauts = df[df['sex'] == 'female'].copy()


- This line filters the dataset to only include rows where the `sex` column is equal to `'female'`
###3. Count the Occupations of Male Astronauts
- Now that we have the subsets for male astronauts, we will count how many astronauts hold each occupation. This can be done using the `.value_counts()` function:

In [179]:
male_occupation_counts = male_astronauts['occupation'].value_counts()

- This counts the number of occurrences of each occupation in the occupation column of the `male_astronauts` subset
- We can double-check and view our subset by running the name of the subset in a line of code:



In [180]:
male_occupation_counts

Unnamed: 0_level_0,count
occupation,Unnamed: 1_level_1
MSP,392
commander,312
pilot,190
flight engineer,176
PSP,54
Other (space tourist),8
Other (Journalist),1
spaceflight participant,1


###4. Count the Occupations of Female Astronauts
- We can do the same process for female astronauts:

In [181]:
female_occupation_counts = female_astronauts['occupation'].value_counts()

- This counts the number of occurrences of each occupation in the occupation column of the `female_astronauts` subset
- Again, we can double-check and view our subset by running the name of the subset in a line of code:




In [182]:
female_occupation_counts

Unnamed: 0_level_0,count
occupation,Unnamed: 1_level_1
MSP,106
flight engineer,20
pilot,7
PSP,5
commander,3
Other (space tourist),2


##**Merging the Subsets**
###1. Combine the Data into One Dataset
- At this point, we have two sets of occupation counts: one for male astronauts and one for female astronauts. To compare them side by side, we will merge these counts into one table using the `pd.concat()` function:


In [183]:
merged_occupation = pd.concat([male_occupation_counts, female_occupation_counts], axis=1, keys=['Male_Counts', 'Female_Counts']).copy()


- This combines the two sets of occupation counts into one dataset
- The `axis=1` function tells **Pandas** to combine the data horizontally
- The `keys` function assigns the column labels 'Male_Counts' and 'Female_Counts' to the respective columns which helps organize our data better
- Let's double-check and view our new dataset by running the name of the subset as shown in the following:

In [184]:
merged_occupation

Unnamed: 0_level_0,Male_Counts,Female_Counts
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
MSP,392,106.0
commander,312,3.0
pilot,190,7.0
flight engineer,176,20.0
PSP,54,5.0
Other (space tourist),8,2.0
Other (Journalist),1,
spaceflight participant,1,


###2. Fill Missing Data
- As you can see we don't have numerical values for some of the `Female_Counts` outputs
- Sometimes, certain occupations may not appear for both males and females, resulting in missing data (NaN). We will fill these missing values with 0 to make sure we can perform comparisons properly:

In [185]:
merged_occupation = merged_occupation.fillna(0)

- We can double-check and view our new dataset once again by running the name of the subset:

In [186]:
merged_occupation

Unnamed: 0_level_0,Male_Counts,Female_Counts
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
MSP,392,106.0
commander,312,3.0
pilot,190,7.0
flight engineer,176,20.0
PSP,54,5.0
Other (space tourist),8,2.0
Other (Journalist),1,0.0
spaceflight participant,1,0.0


- In this output we can see that the code filled in any missing values with 0
###3. Finalizing the Data
- If we exported our file right now, we wouldn’t have a separate column for `occupations` — instead, the occupation names would just be part of the row labels (also called the index in **pandas**). That would make the file harder to read and work with later
-To fix this, we need to move the occupation names out of the index and into their own proper column. We do that with the following line of code:



In [190]:
merged_occupation = merged_occupation.reset_index()


- Now the occupation names will show up as their own column when we export the file — much cleaner and easier to use!

##**Exporting the Subset**
###1. Turn Into `.csv`
- Now that our data is neatly organized and ready, the final step is to save it in a format that's easy to share or open in programs like Excel. We'll do this by exporting it as a `.csv` file
- Run the following code to save your final dataset:

In [193]:
merged_occupation.to_csv("Sex_Occupation_Counts.csv", index=False)


- Once it's done running, click the folder icon in the panel on the left-hand side of Google Colab. **Find your file and download it!**
##_**You made it to the end! Enjoy your new subset of data!**_