This repository contains multiple programs intended to analyze the statements released by select colleges on SCOTUS's ruling on affirmative action.
- 'statement_to_csv.py' utilizes the Google Sheets API to read in data from a column and transform its' contents into individual csv files, stored in folder 'csv_files'
- 'region_finder.py' tags region information and campus size using data from folder 'data_directory' for specified colleges and transforms it into a csv file 'locations_results.csv'
- the 'tfidf' directory contains programs intended to perform term frequency inverse document frequency analysis on our corpus, while the 'sentiment' directory contains programs intended to perform sentiment analysis
- the 'ngram' directory performs n-gram analysis on the corpus of text files. it defines functions to preprocess text, tokenize them, then find top n-grams. it also contains functions to compare ngrams
- the 'response_comparison' directory contains programs intended to compare the similarity between different responses as well as between the responses and a GPT generated response using Jaccard similarity comparison and cosine similarity
- 'word_analysis.py' calculates average word count, lexical diversity, and most frequent words for each response in the corpus and outputs it into 'word_analysis_results.csv'
-
'statement_to_csv' depends on a 'credentials.json' file which is not included in this repository for security reasons. This code does not need to be run as the results are stored in 'csv_files'
-
'region_finder' can be ran from the home directory
-
'tfidf_analysis' needs to be ran from the tfidf directory, and 'vader_sentiment' needs to be run from the sentiment directory
- This repository deploys 'pandas', 'os', 'vaderSentiment', 'sklearn', 'numpy', 'altair', 'csv', 'nltk', 'sklearn', and the 'googleapiclient' packages.