# Create DFs

* The Newsela data is from https://github.com/PaulOpu/TS_LossFunction from my colleagues who did another project with Newsela data.
* I restrained the topic to animal-related articles.
* As the first part of my project was to reproduce Stajner and Hulpus (2015), I chose 1000 articles (200 original titles with their 4 simplified versions)
* The data in the 'britannica' folder is collected from https://kids.britannica.com/
* The output of this notebook is the CSV file 'csv/newsela.csv' and 'csv/britannica.csv'
* For calculations in the following notebooks, a 'path' and a 'level' columns are needed.

#### Import requirements

In [1]:
import numpy as np
from os import listdir
import pandas as pd
from tqdm import tqdm

#### Create DF from directory containing text files

In [2]:
def create_df(folder, newsela=True): 
    '''
    function to create the DF needed for future calculations
    folder is either britannica or newsela
    turn newsela False for britannica data
    '''
    #create list of text file names
    files = [file for file in listdir(folder)]
    
    #loop over text files and save their path, name, id and newsela score
    files_with_score = []
    if newsela:
        for file in files:
            split_file_name = file[:-4].split("-")
            file_name = "-".join(split_file_name[:-2])
            #file_id = split_file_name[-2]
            level = split_file_name[-1]
            files_with_score += [(file, file_name, level)]

    else:
         for file in files:
            split_file_name = file.split("_")
            file_name =  "-".join(split_file_name[:-1])
            level = split_file_name[-1]
            files_with_score += [(file, file_name, level)]
            
    print("Sample row:", files_with_score[0])
        
    
    #save the values from above into a pandas dataframe
    lines = []
    for path, name, score in tqdm(files_with_score):
         lines += [np.concatenate([[path, name, score]])]

    df = pd.DataFrame(
        data=lines,
        columns=np.concatenate([["path","name","score"]]))
    df[df.columns[3:]] = df[df.columns[3:]].astype(float)
    
    # sort the dataframe so the files are in order first by name and then score
    if newsela:
        df["score"]=pd.to_numeric(df["score"])
    
    df.sort_values(["name", "score"], ascending=[True, False], inplace=True)
    df.reset_index(drop=True, inplace=True)
    
    if newsela:
        #add a column for level, where 0 is the original text and 1-2-3-4 are its simplified versions
        #4 being the simplest
        df["level"]=abs(df.index%5-4)
    
    else:
        dic={'kids':2, 'students':1, 'scholars':0}
        df['level']=int
        for ind,row in df.iterrows():
            df["level"][ind]=dic[df["score"][ind]]
    
    df.to_csv("csv/"+folder+".csv")
    
    return df

In [3]:
create_df('newsela', newsela=True).head()

100%|██████████| 1000/1000 [00:00<00:00, 80197.02it/s]

Sample row: ('shark-endangered-224-730.txt', 'shark-endangered', '730')





Unnamed: 0,path,name,score,level
0,Narwhal-whales-sounds-44418-1400.txt,Narwhal-whales-sounds,1400,4
1,Narwhal-whales-sounds-44421-1220.txt,Narwhal-whales-sounds,1220,3
2,Narwhal-whales-sounds-44422-1050.txt,Narwhal-whales-sounds,1050,2
3,Narwhal-whales-sounds-44419-830.txt,Narwhal-whales-sounds,830,1
4,Narwhal-whales-sounds-44420-570.txt,Narwhal-whales-sounds,570,0


In [4]:
create_df('britannica', newsela=False).head()

100%|██████████| 177/177 [00:00<00:00, 89466.35it/s]

Sample row: ('woodpecker_kids', 'woodpecker', 'kids')





Unnamed: 0,path,name,score,level
0,african_penguin_students,african-penguin,students,1
1,african_penguin_scholars,african-penguin,scholars,0
2,african_penguin_kids,african-penguin,kids,2
3,albatross_students,albatross,students,1
4,albatross_scholars,albatross,scholars,0
