# Tri-gram Frequency in Python

This notebook is intended to accompany the following article:
Sutton, S. W., Tolbert, H., and Harris, K. (in press). Data-driven collection development: Text mining college course catalogs. *Kansas Library Association College and University Libraries Section Proceedings*

A sample data file is included here (link TBA) to assist with pre-processing data in a way that will make it usable with the following code.

This code is offered pubicly under a [CC-BY](https://creativecommons.org/licenses/by/2.0/deed.en) license. You are free to share and adapt it, but please give us appropriate attribution, provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

To begin the process of finding which trigrams were most common for courses in the English department, I started by opening the .csv with all of our data. I then isolated the courses coded with “EG” for English and created a new spreadsheet with just those courses.


To begin, I installed the Natural Language Toolkit, a suite of programs and libraries in Python that help to process natural language as used by humans into code readable by computers. I then installed the Pandas library of tools for data analysis and the String module, a suite of tools that helps with working with letters and words.



In [None]:
# install Natural Language Tool Kit
!pip install nltk



In [None]:
import pandas as pd
import nltk
import string

Next, I took the .csv file with just the courses coded “EG” for English and created a dataframe in Python with that information. A data frame is a data structure that is effectively a spreadsheet, an arrangement of rows and columns. When I create the data frame in Colab, unlike in Excel, no visual representation of the dataframe is presented to me. It exists invisibly until we direct the computer to create a visualization.

In [None]:
# upload .csv into df
df = pd.read_csv('/content/EG 2013-2024.csv')

After creating the data frame, I imported a list of stopwords in English from the Natural Language Toolkit and set that list as the one we would use in eliminating unnecessary words from our data.


In [None]:
# download list of stopwords

from nltk.corpus import stopwords

nltk.download('stopwords')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stop_words = set(stopwords.words('english'))

Then I created a translation table. A translation table is a very simple data frame with two columns that maps changes from specific characters to other specific characters. In this case, I wanted all instances of English punctuation to be changed into blank spaces.

After this, with one line of code, I removed the punctuation with the translation table, removed the stop words in the stop words list and converted all characters to lowercase. The last step is important because computers read uppercase characters as completely different from lowercase characters. With everything in lowercase, the computer will not treat “the” as different from “The” or “THE” or “ThE”.



In [None]:
# Create a translation table for removing punctuation
translator = str.maketrans('', '', string.punctuation)

# Remove punctuation and then exclude stopwords, converting all to lowercase
df['Description_cleaned'] = df.Description.astype(str).apply(lambda x: x.translate(translator)).apply(lambda x: [word.lower() for word in x.split() if word.lower() not in stop_words])


Then I imported a tool called a Counter that tallies the amount of times a given value is present in a data structure. After applying that tool to our data, I had a tally of the number of times each trigram appeared in the text. I then created an ordered list of the most common trigrams and looked at it.

In [None]:
# prompt: most frequent trigrams in df.Description

from collections import Counter

trigram_count = Counter(nltk.trigrams(word for row in df.Description_cleaned for word in row))
most_frequent_trigrams = trigram_count.most_common(1000)

print(most_frequent_trigrams)


[(('english', 'office', 'registration'), 319), (('office', 'registration', 'begins'), 319), (('may', 'repeated', 'credit'), 308), (('repeated', 'credit', 'different'), 308), (('credit', 'different', 'topics'), 308), (('specific', 'detailed', 'descriptions'), 297), (('detailed', 'descriptions', 'available'), 297), (('registration', 'begins', 'may'), 297), (('begins', 'may', 'repeated'), 297), (('vary', 'semester', 'semester'), 294), (('descriptions', 'available', 'department'), 289), (('available', 'department', 'english'), 289), (('department', 'english', 'office'), 289), (('semester', 'semester', 'specific'), 286), (('semester', 'specific', 'detailed'), 286), (('addressed', 'vary', 'semester'), 272), (('topics', 'addressed', 'vary'), 261), (('studies', 'specific', 'topics'), 242), (('prerequisite', 'graduate', 'standing'), 176), (('graduate', 'standing', 'permission'), 165), (('standing', 'permission', 'instructor'), 165), (('different', 'topics', 'prerequisite'), 154), (('topics', 'p

The default length of the list was 10, which struck me as far too small to be useful. I increased it to 1,000. It seemed to me that none of the first few dozen or so trigrams were useful. Clearly we would need more stop words than were present in the list included in the NLTK. After speaking with the other authors, we determined that a useful approach would be to filter the subject specific data through another subject specific stop words list.

Thinking about our data specifically and the importance of context when working with data in general, we determined that a trigram that included an element that was obvious (for example, the presence of “English” in a trigram of English course descriptions does not give me any information that I did not already know due to the context in which this data exists) is effectively a bigram and not effective for our purposes of informing collection development decisions.

So my next step was to create a data frame with this list of trigrams. The data frame included the trigram, the number of occurrences of the trigram and the rank of the trigram in the list.


In [None]:
# prompt: most frequent trigrams in df.Description in new df

df_most_frequent_trigrams = pd.DataFrame(most_frequent_trigrams)
df_most_frequent_trigrams.columns = ['Trigram', 'Frequency']
print(df_most_frequent_trigrams)


                             Trigram  Frequency
0    (english, office, registration)        319
1     (office, registration, begins)        319
2            (may, repeated, credit)        308
3      (repeated, credit, different)        308
4        (credit, different, topics)        308
..                               ...        ...
995   (topics, language, literature)         11
996     (language, literature, vary)         11
997     (literature, vary, offering)         11
998       (vary, offering, offering)         11
999   (offering, offering, specific)         11

[1000 rows x 2 columns]


Next I created a new list of words to remove called “unwanted_words” to differentiate it from the initial stop words list. Since my undergraduate degree was in English, I felt confident in making the determination whether or not a given trigram was pertinent. I went down the list, adding words that I felt were not useful to the list.

By the time I got to 35 words on the unwanted words list, the list of trigrams looked useful. It would be possible to refine the list further, but I ended up with a list of 545 trigrams.


In [None]:
# create list of unwanted words
unwanted_words = {'eng', 'flint', 'hills', 'technical', 'english', 'office', 'fundamental', 'registration', 'may', 'credit', 'specific', 'semester', 'available', 'detailed', 'graduate', 'standing', 'topics', 'vary', 'eg', 'studies', 'course', 'designed', 'provide', 'hours', 'coursework', 'provide', 'prerequisities', 'completion', '24', 'grade', 'general', 'requirements', '104', '102', 'eg104', }


# Filter out rows where 'Trigram' contains any unwanted words
df_filtered_trigrams = df_most_frequent_trigrams[
    df_most_frequent_trigrams['Trigram'].apply(lambda trigram: not any(word in unwanted_words for word in trigram))
]

print (df_filtered_trigrams)

                                 Trigram  Frequency
48         (creative, writing, literary)         24
55   (superior, precollege, preparation)         22
67                (sonnets, epic, poems)         22
93            (young, adult, literature)         22
98             (author, author, studied)         22
..                                   ...        ...
986        (students, tools, background)         11
987       (tools, background, necessary)         11
988   (background, necessary, undertake)         11
989    (necessary, undertake, scholarly)         11
990     (undertake, scholarly, research)         11

[545 rows x 2 columns]


In the final step of my analysis in Python, I created a new .csv of the data frame with the list of 545 trigrams.

In [None]:
# create .csv of Data Frame
df_filtered_trigrams.to_csv('EG_cleaned_trigrams_All.csv', index=False)