<a href="https://colab.research.google.com/github/valeriecarr/APEX/blob/main/Module10_your_own_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://images.pexels.com/photos/10325707/pexels-photo-10325707.png?auto=compress&cs=tinysrgb&dpr=2&h=750&w=1260' width=700>  
Photo by Lucas DC from Pexels

# APEX Faculty Training, Module 10: Working with Your Own Data

Created by Valerie Carr and Jaime Zuspann  
Licensed under a Creative Commons license: CC BY-NC-SA  
Last updated: January 3, 2024  

**Learning outcomes**  
* To learn how to use Pandas to access your own data files through two methods: via GitHub and via Google Drive

## 0. A couple notes before you start
* This file is view only, meaning that you can't edit it.
    * To create an editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter`.   
* Keep the following Python style preferences in mind:
    * Variable names should use `snake_case`
    * Include spaces before and after operators, e.g., `x + 1`
    * Don't put unnecessary spaces after a function name, before the parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print (my_variable)`
    * Don't put unnecessary spaces at the beginning or end of parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print( my_variable )`

## 1. Importing Pandas

<font color='red'>Exercise 1</font>  
Since we'll still be using the Pandas library for this module, we'll need to import it usign the usual `pd` abbreviation. Include this line of code below.

## 2. Working with Your Own Data
Now that you have experience working with our premade datasets, you may be wondering how you can instead work with your own data. Thankfully, there are two easy methods for using your own files in Colab: via GitHub or via Google Drive.

If you're unfamiliar with GitHub, it's a platform for hosting code that allows for version control and collaboration. It can also be used for a much simpler purpose, storing datasets in the form of CSV files, which is exactly what we'll be doing.

We will cover both the GitHub and Google Drive approaches in this module, with exercises along the way, as usual. Neither is preferable over the other; rather, we recommend that you choose the method that works best for you.

## 3. The GitHub Approach

### 3a. Create a GitHub Account

If you already have a GitHub account, great! Simply login and proceed to `3b`.

If you don't yet have a GitHub account, the first step will be to navigate to https://github.com and create an account. You should see a field for you to input your email address and a button to "Sign up for GitHub". You'll be taken to another page, where you can edit your email address if necessary, then click "Continue" and create your password and username, and complete verification for your account.

### 3b. Create a Repository
Next, you'll need to create a repository (or "repo") in which you'll store your data. A repository is simply a place to store related files, for example, code and data files all related to the same project.

To create a new repo:
* Click the small `+` button at the top right of the page, next to the user icon. From the drop-down menu that appears, select `New Repository`.
* On the next page, you'll need to specify the repo name (e.g., "data_samples") and add a description (e.g., "data sets to be used in Intro Stats"). Don't worry about changing the other options from the default.
* When you're finished, click the `Create Repository` button. You now have a repo to store your own data!
* Continue reading below for next steps.

### 3c. Add CSV Files

A quick bit of background: Most of us are familiar with excel (.xlsx) files and Google Sheets as ways to save spreadsheet data. A comma separate values (.csv) file is another means of storing spreadsheet data, and it's the recommended file type when working with GitHub. If you have an existing spreadsheet in excel or Sheets, simply go to save as and select CSV.

Back to adding files to your newly creataed repo...

You should see a new page with a blue rectangle at the top that says "Quick setup". At the bottom of this section, you should see: "Get started by creating a new file or uploading an existing file".
* Click the link that says "uploading an existing file", which will take you to a new page that allows you to either drag and drop files, or choose files from your computer.
* Go ahead and upload a CSV file of your choosing using either of these methods.

Underneath the section for adding new files, you'll see another section that says `Commit changes`. When you add a file (or make a change to a file) in your repo, this is called a "commit." In other words, you're committing changes to the repo. Each commit should have a brief descriptive message indicating what changes the commit entailed.

Something simple like "Added my_file.csv" is sufficient. The idea is to be able to look at the message and have an overview of what the commit involved. Below the commit message is an area for an optional and more in-depth description for that commit. This isn't required, but you may find it useful to include a blurb about the data file's contents.

Once you've added the commit message and optional description, click the "Commit changes" button at the bottom. After a few seconds of processing, your CSV file is now hosted in GitHub and can be used in Colab notebooks. Continue reading for one final step.

### 3d. Obtain URL for CSV File
To use a GitHub file (i.e., a file stored in your GitHub repo) for analysis in Colab, you'll need to obtain the URL for that file.

* In your GitHub repo, click the name of the desired CSV file. (Be sure to click the actual file name and not the commit message to the right of it.)
* This will bring you to a page with a preview of the file in a spreadsheet-like format.
At the top right corner of the preview, click the button that says `Raw`.
* This will display the file's contents a bit differently – as comma separated values (which is what you want!). The URL in the address bar is the URL you'll need for Colab.
* Copy this URL, and continue to Exercise 2 for an explanation of where to put the URL within Colab.

<font color='red'>Exercise 2</font>  
In your Colab notebook, you'll need to create a variable to represent the file. Below, we simply use `filename`. Then, you'll paste your URL as a string (i.e., with quote markes around it).
* Replace the sample URL below with your actual URL.
* Next, insert a line of code beneath it that will read the file in and create a dataframe with a name of your choosing. Look back at Module 8, Exercise 4 if you need a reminder.
* Finally, insert code that will allow you to view the header (i.e., the first few lines of the dataframe).

In [None]:
# replace the URL with your own
filename = 'https://raw.githubusercontent.com/bla/bla/bla'

# read in file as dataframe

# view header


### 3e. Adding Files in the Future
You can add more files in the future as follows:
* Login in to your GitHub account and select the relevant repo from the left side of the screen.
* You should now see a list of existing files. Above this list and to the right, click on the `Add file` button and then select `Upload files`.
* The resulting screen should look familiar to you, such that you can drag and drop or select files from your computer. You'll need to include a commit message, commit the file, and then obtain the URL as described above.

And that's the GitHub approach! If you would prefer to store your files on Google Drive, instead, continue reading below.

## 4. The Google Drive Approach
Another approach for working with your own data in Colab is to store your files on Google Drive. There are actually two different methods for using files on Drive:
1. Provide Colab with a link to a Google Sheet
2. Connect Google Drive and Google Colab so that Colab has access to all CSV files stored in Drive

Below we'll provide instructions for each method.

**Note:** These instructions assume that you already have a Google account, e.g., an account associated with your insitution or a personal account.

###4a. Linking to a Google Sheet

<font color='red'>Exercise 10</font>  
The following steps will teach you how to use your own data stored as a Google Sheet:
* Go to [Google Drive](https://drive.google.com/)
* Create a new Google Sheet: In the upper left of Drive, click the `+New` button and select Google Sheets.
    * A new tab will open with a blank Google Sheet. In the upper left, click `Untitled Spreadsheet` and give it a name of your choosing.
* Enter in some simple data for the purposes of this exercise. For example, you could create a single column named `grades` and generate a handful of fake scores.
* Once you're ready to use your Google Sheet, in the upper right of the sheet, click on the `Share` button.
    * Change access to be `Anyone with the link` and then click `Copy link`.
* In the cell below, replace `link` with your sharing link (note: make sure to leave the quote marks in place!)
* Next, you'll need to replace the end of the link (`edit?usp=sharing`) with `export?format=csv`
* Go ahead and run the cell. If everything worked properly, you should now see the first few rows of your dataframe!

In [None]:
# Copy/paste your shareable link, and then modify it
filepath = 'link'
new_df = pd.read_csv(filepath)
new_df.head()

### 4b. Connecting Drive and Colab

<font color='red'>Exercise 11</font>  

*Log in to Drive*

Go to https://drive.google.com/.
* If you're already logged in to the proper account, you should see your Drive folders and files as usual.
* If you're not already logged in, click the "Go to Drive" button at the top right corner, and follow the steps to login. You should now see your folders and files.

*Add desired file(s) to Drive*

In Drive, navigate to where you would like to store your CSV file(s), creating one or more new folders if necessary. Once you've navigatged to the desired folder, you can upload files to Drive by simply dragging and dropping the file onto the page. Alternatively, you can click the "+ New" button at the top left of the page and choose "File Upload", then choose the desired file from your computer.

When you've uploaded the file(s), make sure to note the precise location or "file path", given that you'll need this information later. For example:

`/My Drive/APEX/Data/my_data.csv`

In the above example, slashes denote folders. In other words, within My Drive, imagine you have a folder called APEX, and within that you have another folder called Data, and within that you have your file.

*Connect Drive and Colab*

The next step needs to be completed in Colab rather than in Drive. Specifically, you need to give Colab permission to connect to your Drive account to access the desired file.

In brief, the code below imports a library relevant for working with files in Drive, and then connects to your desired Google Drive account.

Run the cell below, which will bring up several windows and prompts as follows:
* Permission to connect the current notebook to Google Drive; click `Connect to Google Drive`
* Selection of the preferred Google account; simply click the relevant account
* Notification that connecting with Google Drive will allow you to see, edit, create, and delete files in Drive, among other things; click `Allow`

It may take several seconds to connect once you've finished these steps, but eventually you should see an output beneath the cell that says" `Mounted at /content/drive`.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

<font color='red'>Exercise 12</font>  

You're now ready to read in your desired CSV file and create a dataframe. Unlike the GitHub approach, however, we won't be using a URL. Instead, we'll point Colab to the file's location ("file path") in Drive:

* Create a variable called `filepath` or something similar, and assign it the path to your file as a string. As a reminder, you should put slashes between each folder.
* **Note:** When specifying the file path, you will always need to include `/content/drive/` at the very beginning of the path, before you start specifying the exact location.
* Putting it all together, you might have:

`filepath = '/content/drive/My Drive/APEX/Data/my_data.csv'`

Naturally, the part of the path after `My Drive` will be different for each user, and different for each specific file.

In the cell below, create the variable `filepath` and then assign it a string with the path to your desired file, using the above code as an example. Make sure to run the cell, although note that you won't see any output.

<font color='red'>Exercise 13</font>  
Awesome! Now that you have a variable that contains the path to your file, you can simply read in the CSV file using the Pandas process with which you're familiar. Like before, go ahead and check the header to make sure that everything worked properly.

If you get an error along the lines of "File does not exist"...
* This is an indication that there's a problem with your file path.
* Go back to Exercise 12 and double check that you've properly specified the path, keeping in mind that Python is picky about spelling, capitalization, spacing, and so forth. Also double check that your path includes the `/content/drive/` bit at the beginning (with a slash before `content`).
* Once you've fixed your path, try running the cell in Exercises 12 and 13 again.

## All done!
Now you're ready to use your own data files using either of the methods covered here. It would be great to practice some of the Python concepts you've learned thus far on your own data. Feel free to review prior modules, but this time using your own data instead!

Next up, we'll cover the basics of conducting statistical analyses in Python.