<a href="https://colab.research.google.com/github/scskalicky/SNAP-CL/blob/main/05_Loading_Texts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reading in a longer text**
It's now time to start thinking about how to get your own data into Python/NLTK so you can analyse it. 


Navigate to [this page](https://raw.githubusercontent.com/scskalicky/SNAP-CL/main/tmoom.txt). You should see some text. Copy and paste the text into a text editor and save it to your desktop (or somewhere else on your computer). Name the file `tmoom.txt`. Make sure the file is a `.txt` file. 

Now that we have a text file, the next step is to put the file somewhere that the Colab notebook can reach. To do so, click on the file folder icon on the left hand side of the window. The menu should expand and you will see a folder called "sample_data" which was loaded with this notebook. 

You will also see three icons above which, from left to right, let you upload material into the current session, refresh the list of files, mount your google drive, and show hidden files.

The first option is rather easy and allows you to upload the `tmoom.txt` file direclty into the drive space. You will get a warning telling you that the file will be removed whever the notebook closes. 

The third option (the grey file folder with the google drive icon) allows you to "mount" your google drive. This provides temporary access between the notebook and your entire google drive, which then allows for you to read and write files from your drive which will not be removed once the notebook session ends. 

Whichever option you choose, we can then use the `open()` function to load in the file. You will need to provide the path to the file inside the brackets, and this needs to be typed as a string. If you choose the first option and upload the text into the notebook, you should be able to run the cell below and get the file into the workspace.

Because we want to store the *contents* of the .txt file to a variable, we will append `.read()` to the end of `open()`.


In [6]:
# Load in the contents of the .txt file and save it as a variable
tmoom = open('tmoom.txt').read()

In [7]:
# inspect the variable - it is saved as a single string
tmoom

'"They\'re made out of meat."\n"Meat?"\n"Meat. They\'re made out of meat."\n"Meat?"\n"There\'s no doubt about it. We picked several from different parts of the planet, took them aboard our recon vessels, probed them all the way through. They\'re completely meat."\n"That\'s impossible. What about the radio signals? The messages to the stars."\n"They use the radio waves to talk, but the signals don\'t come from them. The signals come from machines."\n"So who made the machines? That\'s who we want to contact."\n"They made the machines. That\'s what I\'m trying to tell you. Meat made the machines."\n"That\'s ridiculous. How can meat make a machine? You\'re asking me to believe in sentient meat."\n"I\'m not asking you, I\'m telling you. These creatures are the only sentient race in the sector and they\'re made out of meat."\n"Maybe they\'re like the Orfolei. You know, a carbon-based intelligence that goes through a meat stage."\n"Nope. They\'re born meat and they die meat. We studied them f

We can now operate on this text as we have done with out previous examples. Since this is a new notebook, we'll need to reload NLTK and its resources first. 

In [8]:
# load NLTK library
import nltk
# download resources necessary for tokenizing and part of speech tagging.
nltk.download(['punkt', 'averaged_perceptron_tagger', 'tagsets'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

We can load in the same preprocess, lexical diversity, and pipeline functions as before:

In [9]:
# define a string containing punctuation markers we do not want
punctuation = '!.,\'";:-'

# define a function to pre-process text
def preprocess(text):
  # lower case the text and save results to a variable
  lower_case = text.lower()
  # remove punctuation from lower_case and save to a variable
  # don't worry too much if you don't understand the code in this line. 
  lower_case_no_punctuation = ''.join([character for character in lower_case if character not in punctuation])
  # return the new text to the user
  return lower_case_no_punctuation



# define a function to calculate lexical diversity
def lexical_diversity(tokens):
  # return the result of dividing the length 
  return len(set(tokens))/len(tokens)



def pipeline(string_input):
  # first lowercase the string and clear punctuation using our preprocess function (defined above)
  preprocess_string = preprocess(string_input)

  # now use NLTK to tokenize the preprocessed text
  tokenized_string = nltk.word_tokenize(preprocess_string)

  # calculate the diversity function (defined above)
  ld = lexical_diversity(tokenized_string)

  # pos tag the tokens
  pos_tagged_string = nltk.pos_tag(tokenized_string)

  # calculate frequency of words and tags
  fdist = nltk.FreqDist(pos_tagged_string)

  # output some information about the text
  print(f"""
  Length:\t{len(tokenized_string)}\n
  Lexical Diversity:\t{ld}\n
  Top 5 Frequent Words:\t{fdist.most_common(5)}
  """)

Voila, we can now operate on our text as we have been doing. 

In [13]:
pipeline(tmoom)


  Length:	808

  Lexical Diversity:	0.37995049504950495

  Top 5 Frequent Words:	[(('the', 'DT'), 42), (('meat', 'NN'), 40), (('?', '.'), 29), (('to', 'TO'), 21), (('they', 'PRP'), 18)]
  


If you choose to mount your Google Drive, you only need to change the file path to match wherever the file is on your Google Drive.

The start of your file path will always be `content/drive/MyDrive/...`, where the `...` represents the root level of your Google Drive. 

So if you had `tmoom.txt` saved in a folder named `texts` in your Google Drive, the filepath would be:



In [None]:
# in case you want to mount your drive instead of manual upload. 
tmoom2 = open('/content/drive/MyDrive/texts/tmoom.txt').read()