<a href="https://colab.research.google.com/github/scskalicky/VocabAtVic2023NLPWorkshop/blob/main/05-loading-data-into-colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supplemental: Using your own data
All of the examples in this workshop use provided data. 

But you probably want to use your own data. 

This notebook shows you a few options for doing so.

# Option 1: Mounting your Google Drive

One way to do so involves connecting Google Colab with your Google Drive.

The process of connecting Colab to your Google Drive is known as mounting your drive. To do so, you click on the folder icon on the left side of the Colab page:

<img src = https://i.imgur.com/82Wedue.png>

Then you click the "mount drive" icon in the next menu:

<img src = https://i.imgur.com/d8DxFIu.png>

Colab should then automatically add a code cell like this:

<img src = https://i.imgur.com/ttfUkwi.png>

Run the cell to mount your Google Drive. You will most likely see several permissions prompts asking you if its okay to make this connection with the associated Google account. It's fine to do this with notebooks you make or the ones I give you, but be wary of other notebooks that might try to ask for your account permissions. There is likely no big risk but I feel obligated to tell you that you should not blindly trust any other Colab notebooks you might come across.



## Accessing files in your Google Drive

Now that your Drive is connected, you can directly access files in your Google Drive account. This is very handy. (You might need to click the refresh button (the folder with the circle arrow) to see the new folder).

You should see a new folder on the left side menu (after clicking on the folder icon) called `drive`. Clicking that folder should then reveal a subfolder called `MyDrive`. The `MyDrive` folder is the root folder for your Google Drive.

<img src = https://i.imgur.com/Av1mGtQ.png>




In order to access files on your drive, you will need to be able to give Python the full filepath to your files. No matter where your files are, the start of your filepath will always be `/content/drive/MyDrive/...`, where the `...` are any additional folders.

So, for example, if you had a file called `mydata.txt` located in the base level of your Google Drive, the filepath location would be `/content/drive/MyDrive/mydata.txt`. If you had that same file located in a folder called `mydata`, the filepath would be `/content/drive/MyDrive/mydata/mydata.txt`, and so on.

# Option 2: Using `!wget` or the `requests` library

Using Google Drive is a solid bet for integrating with Colab, but you might not like mounting drives each time you run a notebook, or working with files in your drive.

There are other options which involve reading files directly from the internet, using other functions such as `!wget` or Python libraries for requesting data from URLs, such as the `requests` library. There are various places in these materials which show how to do either method.

However, using these methods requires that the data already exists on the internet somewhere, and also exists at a URL you can access (and ideally control). Therefore, I only recommend using this method if you able to control the place where you data lives - and it might just be easier for you to use Google Drive if you don't want to go that route. But, using GitHub is a solid choice, and one which is used in some of these notebooks (as well as the example below).

The main benefit of using `!wget` is that the data is loaded directly into the notebook environment, so you would not need to muck around with sifting through files on the Google Drive. This method is also a bit easier to share resources with, since someone else would not need to have the same data on their Drive.

Below is an example of using `!wget` to access a text file saved on GitHub.

In [None]:
# using !wget to load a file into the notebook environment
!wget 'https://raw.githubusercontent.com/scskalicky/VocabAtVic2023NLPWorkshop/main/tmoom.txt'

Instead of pointing at `/content/drive/MyDrive/...`, you instead just point at `/content/...`

You need to use the appropriate method to open the file, such as using `open()` to open a text file:

In [None]:
# read in the text
tmoom = open('tmoom.txt').read()

# split into tokens
tmoom_tokens = nltk.word_tokenize(tmoom)

# look at the first ten tokens!
[token for token in tmoom_tokens][:10]

## `requests`

You can also read in data directly from the url using Python libraries such as `requests` or `urllib`. This method still requires that you know where you can access a text file from, but unlike `!wget` will load the file directly into Python, rather than through the notebook environment first. You typically need to point a function towards a url and then use some additional methods to open the data. This works best for raw `.txt` files.

In [None]:
# import the library
import requests

In [None]:
# save URL to a variable
URL = 'https://raw.githubusercontent.com/scskalicky/VocabAtVic2023NLPWorkshop/main/tmoom.txt'

In [None]:
# use .get() to retrieve file at the URL
data = requests.get(URL)

You can see that the information is saved in the variable in a format specific to the requests library. On its own, we can't see that text in the data object.  

In [None]:
data

This variable has a variety of attributes, one of which is the `text` attribute, which includes the text of the URL. In this case, it is a `.txt` file. You can access the text using `.text` - note that you do not need to use the brackets.

In [None]:
data.text

We can of course chain these functions together in order to read in the text and convert it into tokens or some other format in one single line. In the cell below, I split the URL results on newlines inside of a list comprehension. The result is requesting the file and receiving a list of all the sentences in the file.

In [None]:
[line for line in requests.get(URL).text.split('\n')]

# Option 3: Uploading files manually

There is also a way to upload files directly to the notebook environment. This involves using a function from the colab library. First, import the function.

In [None]:
# import the files function from colab
from google.colab import files

After importing the function, you now have access to a few functions, one of which allows you to upload files into your notebook. You do this with the `files.upload()` function.

In [None]:
# run a cell with this command to prompt the user to upload files.
files.upload()

You can then choose a file from anywhere on your computer and upload it to the notebook environment. The file can then be accessed using the same methods used with `!wget`.