Instructions from Prof:

Alrighty y’all, with about a month left in the semester (how? HOW??), please read through this document for the next steps I’m envisioning:
https://gtvault-my.sharepoint.com/:w:/g/personal/aduncan9_gatech_edu/ERDbeQND2pdJsK77-9HTWqwBkCYJ4zXoB-66jYMcCF1LmA?e=7b0Nim
And thanks for the suggestions on coding topics! I revised the topics a bit, trying to take into account what’s interesting, what are somewhat "hot topics", and what topics should have enough papers to actually find meaningful trends. I restricted it to 4 topics because [see previous comment about there only being a month left in the semester].

What this notebook does:
1. Read in the list of L@S papers provided by prof. This document does not include abstract (this is "prof_df" in this notebook)
2. Extract all L@S papers from Scopus manually. This document includes abstract (this is "df" in this notebook)
3. Use fuzzywuzzy to match df with prof_df using the title of the papers, so that we have the abstract for every paper in prof_df.
4. Export document.


In [1]:
import pandas as pd

In [74]:
import pandas as pd
# Downloaded from Scopus
# https://www.scopus.com/results/results.uri?sort=plf-f&src=s&st1=Developing+Student%27s+Global+Competencies+at+Scale+in+an+Affordable+MOOC+K12+Outreach+Initiative&sid=80ba9e19ec822c92b36edfab5890b776&sot=b&sdt=b&sl=105&s=%28KEY%28l%40s+2016%29+OR+SRCTITLE%28l%40s+2017%29+OR+SRCTITLE%28l%40s+2022%29%29&origin=searchbasic&editSaveSearch=&yearFrom=Before+1960&yearTo=Present&sessionSearchId=80ba9e19ec822c92b36edfab5890b776&limit=10
# By searching "source title l@s OR source title learning at scale"
df = pd.read_csv('input/all_papers.csv')
# df = df.dropna(subset=['Author full names']) # there are empty results

In [75]:
df['Source title'].unique()

array(['L@S 2022 - Proceedings of the 9th ACM Conference on Learning @ Scale',
       'L@S 2023 - Proceedings of the 10th ACM Conference on Learning @ Scale',
       'Improving Student Learning at Scale: a How-To Guide for Higher Education',
       'L@S 2021 - Proceedings of the 8th ACM Conference on Learning @ Scale',
       'L@S 2020 - Proceedings of the 7th ACM Conference on Learning @ Scale',
       'Proceedings of the 6th 2019 ACM Conference on Learning at Scale, L@S 2019',
       'Proceedings of the 5th Annual ACM Conference on Learning at Scale, L at S 2018',
       'L@S 2017 - Proceedings of the 4th (2017) ACM Conference on Learning at Scale',
       'L@S 2016 - Proceedings of the 3rd 2016 ACM Conference on Learning at Scale',
       'L@S 2015 - 2nd ACM Conference on Learning at Scale',
       'L@S 2014 - Proceedings of the 1st ACM Conference on Learning at Scale'],
      dtype=object)

In [76]:
df.columns

Index(['Authors', 'Author full names', 'Author(s) ID', 'Title', 'Year',
       'Source title', 'Volume', 'Issue', 'Art. No.', 'Page start', 'Page end',
       'Page count', 'Cited by', 'DOI', 'Link', 'Affiliations',
       'Authors with affiliations', 'Abstract', 'Author Keywords',
       'Index Keywords', 'Molecular Sequence Numbers', 'Chemicals/CAS',
       'Tradenames', 'Manufacturers', 'Funding Details', 'Funding Texts',
       'References', 'Correspondence Address', 'Editors', 'Publisher',
       'Sponsors', 'Conference name', 'Conference date', 'Conference location',
       'Conference code', 'ISSN', 'ISBN', 'CODEN', 'PubMed ID',
       'Language of Original Document', 'Abbreviated Source Title',
       'Document Type', 'Publication Stage', 'Open Access', 'Source', 'EID'],
      dtype='object')

In [77]:
df['Abstract']

0      Embedding a post recommendation system in onli...
1      The popularity of Massive Open Online Courses ...
2      The emerging field of affordable degrees at sc...
3      The United States is experiencing a shortage o...
4      In past work, time management interventions in...
                             ...                        
673    This paper discusses learning at scale from th...
674    Rapid feedback is a core component of mastery ...
675    An efficient peer grading mechanism is propose...
676    For an instructor who is teaching a massive op...
677                              [No abstract available]
Name: Abstract, Length: 678, dtype: object

In [78]:
# Prof's document
# Extracte from https://gtvault-my.sharepoint.com/personal/aduncan9_gatech_edu/_layouts/15/onedrive.aspx?ga=1&id=%2Fpersonal%2Faduncan9%5Fgatech%5Fedu%2FDocuments%2FConferences%20and%20Journals%2FLearning%20at%20Scale%202024%2FTeam%20documents%2FPapers
# The name of document is "Paper list.xlsx"
prof_df = pd.read_csv('input/prof_doc.csv')
prof_df

Unnamed: 0,Paper Title,Year,Full Paper or WIP?,Session Name,"Context: MOOCs, In-person degrees, Online degrees, K-12, Informal learning environments, Other (specify in 1-2 words in Context Notes)",Context Notes,"Subject: CS, Non-CS STEM, Non-STEM (specify in 1-2 words in Subject Notes), Unspecified",Subject Notes,"AI/ML Applications in Education? (Enter ""Y"" or ""N"")"
0,Student skill and goal achievement in the mapp...,2014,Full Paper,Student Skills and behavior,,,,,
1,Correlating Skill and Improvement in 2 MOOCs w...,2014,Full Paper,Student Skills and behavior,,,,,
2,Demographic differences in how students naviga...,2014,Full Paper,Student Skills and behavior,,,,,
3,Understanding in-video dropouts and interactio...,2014,Full Paper,Course materials,,,,,
4,How video production affects student engagemen...,2014,Full Paper,Course materials,,,,,
...,...,...,...,...,...,...,...,...,...
557,Towards Game-based Assessment at Scale,2023,Full Paper,Works-in-Progress,,,,,
558,Towards Scalable Vocabulary Acquisition Assess...,2023,Full Paper,Works-in-Progress,,,,,
559,Towards the Identification of Experts in Infor...,2023,Full Paper,Works-in-Progress,,,,,
560,Unlocking Financial Success: Empowering Higher...,2023,Full Paper,Works-in-Progress,,,,,


In [93]:
# Match title of prof_df and df using fuzzywuzzy
from fuzzywuzzy import fuzz

# For each doc in prof_df (title in the column "Paper Title"), find the title of the most similar doc title from df["Title"] using fuzzywuzzy matching
# The similarity score is stored in the column "Similarity Score"
# The title of the most similar doc is stored in the column "Matched Title"
prof_df['Similarity Score'] = 0
prof_df['Matched Title'] = None

for i in range(len(prof_df)):
    max_score = 0
    matched_title = ''
    
    for j in range(len(df)):
        score = fuzz.ratio(prof_df.loc[i,'Paper Title'].lower(), df.loc[j,'Title'].lower())
        if score > max_score:
            max_score = score
            matched_title = df['Title'][j]

    # Assign the max score and matched title to the prof_df without raising slicing copy error
    prof_df.loc[i,'Similarity Score'] = max_score

    # If score is at least  70, then we consider it as a match
    if max_score >= 70:
        prof_df.loc[i,'Matched Title'] = matched_title
    
prof_df


Unnamed: 0,Paper Title,Year,Full Paper or WIP?,Session Name,"Context: MOOCs, In-person degrees, Online degrees, K-12, Informal learning environments, Other (specify in 1-2 words in Context Notes)",Context Notes,"Subject: CS, Non-CS STEM, Non-STEM (specify in 1-2 words in Subject Notes), Unspecified",Subject Notes,"AI/ML Applications in Education? (Enter ""Y"" or ""N"")",Similarity Score,Matched Title
0,Student skill and goal achievement in the mapp...,2014,Full Paper,Student Skills and behavior,,,,,,100,Student skill and goal achievement in the mapp...
1,Correlating Skill and Improvement in 2 MOOCs w...,2014,Full Paper,Student Skills and behavior,,,,,,99,Correlating skill and improvement in 2 MOOCs w...
2,Demographic differences in how students naviga...,2014,Full Paper,Student Skills and behavior,,,,,,100,Demographic differences in how students naviga...
3,Understanding in-video dropouts and interactio...,2014,Full Paper,Course materials,,,,,,99,Understanding in-video dropouts and interactio...
4,How video production affects student engagemen...,2014,Full Paper,Course materials,,,,,,100,How video production affects student engagemen...
...,...,...,...,...,...,...,...,...,...,...,...
557,Towards Game-based Assessment at Scale,2023,Full Paper,Works-in-Progress,,,,,,100,Towards Game-based Assessment at Scale
558,Towards Scalable Vocabulary Acquisition Assess...,2023,Full Paper,Works-in-Progress,,,,,,100,Towards Scalable Vocabulary Acquisition Assess...
559,Towards the Identification of Experts in Infor...,2023,Full Paper,Works-in-Progress,,,,,,100,Towards the Identification of Experts in Infor...
560,Unlocking Financial Success: Empowering Higher...,2023,Full Paper,Works-in-Progress,,,,,,100,Unlocking Financial Success: Empowering Higher...


In [94]:
#Merge prof_df and df using "Matched Title"
merged_df = pd.merge(prof_df, df, how='left', left_on='Matched Title', right_on='Title')

In [95]:
cols = merged_df.columns

# Remove "Abstract" from cols
cols = cols.drop('Abstract')

# Add Abstract to be the fifth
cols = cols.insert(4, 'Abstract')

cols


Index(['Paper Title', 'Year_x', 'Full Paper or WIP?', 'Session Name',
       'Abstract',
       'Context: MOOCs, In-person degrees, Online degrees, K-12, Informal learning environments, Other (specify in 1-2 words in Context Notes)',
       'Context Notes',
       'Subject: CS, Non-CS STEM, Non-STEM (specify in 1-2 words in Subject Notes), Unspecified',
       'Subject Notes', 'AI/ML Applications in Education? (Enter "Y" or "N")',
       'Similarity Score', 'Matched Title', 'Authors', 'Author full names',
       'Author(s) ID', 'Title', 'Year_y', 'Source title', 'Volume', 'Issue',
       'Art. No.', 'Page start', 'Page end', 'Page count', 'Cited by', 'DOI',
       'Link', 'Affiliations', 'Authors with affiliations', 'Author Keywords',
       'Index Keywords', 'Molecular Sequence Numbers', 'Chemicals/CAS',
       'Tradenames', 'Manufacturers', 'Funding Details', 'Funding Texts',
       'References', 'Correspondence Address', 'Editors', 'Publisher',
       'Sponsors', 'Conference name', 

In [96]:
# export prof_df to csv
merged_df[cols].to_csv('output/prof_df_with_additional_info.csv', index=False)

In [98]:
# Print all the matched titles as a sanity check
print(merged_df.sort_values('Similarity Score', ascending=True)[['Paper Title','Title','Similarity Score']].to_string())

# We can see that even the lowest similarity score has the correct title matched

                                                                                                                                                            Paper Title                                                                                                                                                               Title  Similarity Score
542                                                                                                                                                       Examinator v3                                                                                                                                                                 NaN                52
391                                                                                                                                    Teaching at Scale and Back Again                                                                                                                                     

In [99]:
assert len(merged_df) >= len(prof_df)