Skip to content

Latest commit

 

History

History
28 lines (19 loc) · 2.59 KB

README.md

File metadata and controls

28 lines (19 loc) · 2.59 KB

Digital Ethnic Futures Lab - SCOTUS College Statement Text Analyis

Description

This repository contains multiple programs intended to analyze the statements released by select colleges on SCOTUS's ruling on affirmative action.

  • 'statement_to_csv.py' utilizes the Google Sheets API to read in data from a column and transform its' contents into individual csv files, stored in folder 'csv_files'
  • 'region_finder.py' tags region information and campus size using data from folder 'data_directory' for specified colleges and transforms it into a csv file 'locations_results.csv'
  • the 'tfidf' directory contains programs intended to perform term frequency inverse document frequency analysis on our corpus, while the 'sentiment' directory contains programs intended to perform sentiment analysis
  • the 'ngram' directory performs n-gram analysis on the corpus of text files. it defines functions to preprocess text, tokenize them, then find top n-grams. it also contains functions to compare ngrams
  • the 'response_comparison' directory contains programs intended to compare the similarity between different responses as well as between the responses and a GPT generated response using Jaccard similarity comparison and cosine similarity
  • 'word_analysis.py' calculates average word count, lexical diversity, and most frequent words for each response in the corpus and outputs it into 'word_analysis_results.csv' while 'word_analysis_plot.py' is used to plot its results
  • 'word_phrase.py' finds the percentage of texts that contain certain words or phrases out of the entire corpus
  • 'identify_category.py' categorizes college responses according to specific lexicons
  • 'jbdelta_average.py' tokenizes responses, calculates word frequency statistics, and computes the deviations of each text from the corpus average using z-scores, as well as visualizes these deviations using a bar chart
  • 'jbdelta_reference.py' is similar to 'jbdelta_average.py', but instead calculates the deviation between a single test text and the rest of the corpus

Getting Started

  • 'statement_to_csv' depends on a 'credentials.json' file which is not included in this repository for security reasons. This code does not need to be run as the results are stored in 'csv_files'

  • 'region_finder' can be ran from the home directory

  • 'tfidf_analysis' needs to be ran from the tfidf directory, and 'vader_sentiment' needs to be run from the sentiment directory

Dependencies

  • This repository deploys 'pandas', 'os', 'vaderSentiment', 'sklearn', 'numpy', 'altair', 'csv', 'nltk', 'sklearn', and the 'googleapiclient' packages.