<a href="https://colab.research.google.com/github/JaimeAdele/APEX/blob/main/Module10_your_own_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://images.pexels.com/photos/10325707/pexels-photo-10325707.png?auto=compress&cs=tinysrgb&dpr=2&h=750&w=1260' width=700>  
Photo by Lucas DC from Pexels

# APEX Faculty Training, Module 10: Working with Your Own Data

Created by Valerie Carr and Jaime Zuspann  
Licensed under a Creative Commons license: CC BY-NC-SA  
Last updated: Mar 6, 2022  

**Learning outcomes**  
* To learn how to use Pandas to access your own data files through two methods: via GitHub and via Google Drive

## 1. A couple notes before you start 
* This file is view only, meaning that you can't edit it.
    * To create an editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter`.   
* Keep the following Python style preferences in mind:
    * Variable names should use `snake_case`
    * Include spaces before and after operators, e.g., `x + 1`
    * Don't put unnecessary spaces after a function name, before the parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print (my_variable)`
    * Don't put unnecessary spaces at the beginning or end of parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print( my_variable )`

<font color='red'>Exercise 1</font>  
Since we'll still be using the Pandas library for this module, we'll need to import it before we do any exercises. Type in the line of code that imports the pandas library, and use the alias `pd` that we've been using in other modules. This is common practice.

## 2. Accessing Data via GitHub
Now that you have some experience working with our preset datasets, you may be wondering how you could do this with your own data. Thankfully, there are two easy methods for loading in your own data to Colab--through GitHub and Google Drive. We will cover both in this module, with exercises along the way, as usual. Remember, you can choose the method that works best for you and stick with that afterward!

GitHub, if you're unfamiliar with it, is a code hosting platform for version control and collaboration. For our case here, however, we will use it to store datasets in the form of csv files, and learn how to load those datasets into Colab. Let's go through the process step by step.

### 2a. Create a GitHub Account
First, use this link to navigate to GitHub: https://github.com. Immediately, you'll see a field for you to input your email address and a button to "Sign up for GitHub". You'll be taken to another page, where you can edit your email address if necessary, then click "Continue" and create your password and username, and complete verification for your account.

### 2b. Create a Repository
Great! You now have a GitHub account! Now, we need to create a repository in which you'll store your data. A repository is just a place to store related files, usually code. You can think of each repository as a project. To create a new repository, click the small "+" button at the top right of the page, next to the user icon. You'll see a dropdown menu--click "New Repository". On the next page, you'll need to specify the repository name (i.e. "data-samples") and add a description if you'd like. Don't worry about changing the other options from the default. When you're finished, click the "Create Repository" button. And that's it--now you have a repository to store your own data!

### 2c. Add CSV Files
Now that we have a repository, we're going to start filling it with csv files. After you've created the repository, you were taken to a page with a bunch of command line commands. Don't worry about these; instead, toward the top of the page in the Quick Setup section, click the link that says "uploading an existing file". Now you're on a page that says you can either drag and drop files, or choose files from your computer. Go ahead and upload a csv file of your choosing using either of these methods.

Just below, you'll see a section that says "Commit changes". When you save a file (or save a change to a file) in your repository, this is called a "commit" -- you're committing changes to the repository. Each commit should have a brief descriptive message indicating what changes that commit makes. Something simple like "Adding \<your_filename.csv\>" is sufficient. The idea is to be able to look at the message and have an overview of what that commit does. Below the commit message is an area for an optional more in-depth description for that commit. This isn't necessary, but may be useful to describe the contents of the added file, or exactly what changes were made.

Once you've added the commit message and optional description, click the "Commit changes" button at the bottom. And that's it! Now, your csv file is hosted in GitHub and can be used here in Colab.

### 2d. Obtain URL for CSV File
In order to use your newly uploaded csv file in Colab, you'll need to get the url. In your GitHub repository, click the name of your file. This will bring you to a page with a preview of the file in a spreadsheet-like format. At the top right corner of the preview, there is a button that says "Raw". Clicking this button opens the file in your browser and displays the contents as they actually are -- comma separated values. The url in the address bar is the url you'll use to read in the file here in Colab. Remember that the url is used as a string when reading in a file with Python, so be sure to add quotation marks around it.

<font color='red'>Exercise 2</font>  
Now that you've uploaded your csv file to GitHub and retrieved the url, read it in below by replacing the `YOUR_URL_HERE` text with the url you just retrieved from GitHub. <font color='orange'>We've imported the Pandas library and added the code to read in the file already. (Should we have them do this themselves? Maybe with a few hints to remind them?)</font>

In [5]:
filename = 'YOUR_URL_HERE'
my_df = pd.read_csv(filename)

And that's it! Now you can use your own dataset with all of the operations we've covered in these modules!  

<font color='red'>Exercise 3</font>  
Confirm that the new dataframe indeed contains the data from your csv file by using the `head()` function to display however many observations from the set that you'd like.

## 3. Accessing Data via Google Drive
Another method for reading in your own data to Colab is to use Google Drive. Since many people already use Google Drive for other things, it may be more convenient to keep data there.

### 3a. Login to Google Drive
In this module, we assume that you have a Google account, and therefore do not need to create an account in order to access Google Drive. Use this url to navigate to Google Drive: https://drive.google.com/. If you are not already logged into your Google account, click the "Go to Drive" button at the top right corner, and follow the steps to login. 

### 3b. Adding Files to Google Drive
Once you're in Google Drive, navigate to where you would like to store your csv file, creating new folders if necessary. Once there, you can upload files to Drive by simply dragging and dropping the file onto the page. Alternatively, you can click the "+ New" button at the top left of the page and choose "File Upload", then choose the desired file from your computer. 

When you've uploaded the file, make sure to note the precise location, as you'll need it later. For example, My Drive > APEX > Files > my_data.csv.

### 3c. Connecting Colab with Drive
Now you need to give Colab permission to connect to your Drive account. The cell below contains the necessary code to do so.  

<font color='red'>Exercise 4</font>  
Run the cell below, then follow the prompts to confirm the connection, give permissions, and choose which account to connect to. It may take a second to connect when you're finished, but once it's connected you'll see `Mounted at /content/drive` as the output.

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 3d. Reading a File from Drive
Next you'll need to tell Colab exactly where to find your file within Drive. To do so, create a variable called `filepath` or something similar, just as before, and assign to it the path to your file, in the form of a string (again, remember the quotes). In Python, we put slashes between each folder (and before the 'content' folder), so this line of code should look something like this:  
`filepath = '/content/drive/My Drive/APEX/Files/my_data.csv'`  

<font color='blue'>Note:</font>  
When specifying the filepath, you can leave the first part, `/content/drive/My Drive/`, as is. You'll only need to change the files that appear after these

<font color='red'>Exercise 5</font>  
In the cell below, assign your filepath to a variable, like above. Make sure to run the cell afterward.

<font color='red'>Exercise 6</font>  
Awesome! Now that you have a variable that contains the path to your file, read the csv into a dataframe like before, and check the first few observations with the `head()` function to make sure that everything worked properly.

## All done!
Now you're ready to use your own data files using either of the methods covered here. It would be great to practice some of the operations you've learned on your own data--see what you can do!