## Portfolio Assignment 1: Basic Scripting with Python

Using the corpus called 100-english-novels found on the cds-language GitHub repo, write a Python programme which does the following:

1. Calculate the total word count for each novel
2. Calculate the total number of unique words for each novel
3. Save result as a single file consisting of three columns: filename, total_words, unique_words

__TASK 1: CALCULATE THE TOTAL WORD COUNT FOR EACH NOVEL__

In [1]:
# First I want to import the novels from the 100-english-novels corpus and for this I need the Path module

# Importing Path module and the os module
from pathlib import Path
import os

# Specifying the data path
data_path = os.path.join("..", "data", "100_english_novels", "corpus")

# Importing all files (all novels) ending with ".txt" using the glob() function. I then split each novel into tokens (words) using the split() function and then I count the number of tokens/words for each novel using the len() function.
for filename in Path(data_path).glob("*.txt"):
    with open (filename, "r", encoding = "utf-8") as file:
        novel = file.read()
        split_novel = novel.split() # splitting the novel into tokens/words
        print(f"{filename} has a word count of {len(split_novel)}") # counting the number of words in each novel
        

../data/100_english_novels/corpus/Cbronte_Villette_1853.txt has a word count of 196557
../data/100_english_novels/corpus/Forster_Angels_1905.txt has a word count of 50477
../data/100_english_novels/corpus/Woolf_Lighthouse_1927.txt has a word count of 70185
../data/100_english_novels/corpus/Meredith_Richmond_1871.txt has a word count of 214985
../data/100_english_novels/corpus/Stevenson_Treasure_1883.txt has a word count of 68448
../data/100_english_novels/corpus/Forster_Howards_1910.txt has a word count of 111057
../data/100_english_novels/corpus/Wcollins_Basil_1852.txt has a word count of 118088
../data/100_english_novels/corpus/Schreiner_Undine_1929.txt has a word count of 90672
../data/100_english_novels/corpus/Galsworthy_Man_1906.txt has a word count of 110455
../data/100_english_novels/corpus/Corelli_Innocent_1914.txt has a word count of 121950
../data/100_english_novels/corpus/Kipling_Light_1891.txt has a word count of 72479
../data/100_english_novels/corpus/Conrad_Nostromo_1904.

__TASK 2: CALCULATE THE TOTAL NUMBER OF UNIQUE WORDS FOR EACH NOVEL__

In [2]:
# For this task I am going to use the set() function that removes duplicates of words which allows me to find the unique words for each novel. I then use the len() function to count the number of unique words, i.e. words that do not occur more than once, as identified by the set() function.
for filename in Path(data_path).glob("*.txt"):
    with open (filename, "r", encoding = "utf-8") as file:
        novel = file.read()
        split_novel = novel.split() # splitting the novel into words
        unique_words = set(split_novel) # removing duplicate words
        print(f"{filename} contains {len(unique_words)} unique words") # counting the number of unique words for each novel
        

../data/100_english_novels/corpus/Cbronte_Villette_1853.txt contains 29084 unique words
../data/100_english_novels/corpus/Forster_Angels_1905.txt contains 9464 unique words
../data/100_english_novels/corpus/Woolf_Lighthouse_1927.txt contains 11157 unique words
../data/100_english_novels/corpus/Meredith_Richmond_1871.txt contains 28892 unique words
../data/100_english_novels/corpus/Stevenson_Treasure_1883.txt contains 10831 unique words
../data/100_english_novels/corpus/Forster_Howards_1910.txt contains 17065 unique words
../data/100_english_novels/corpus/Wcollins_Basil_1852.txt contains 14586 unique words
../data/100_english_novels/corpus/Schreiner_Undine_1929.txt contains 11744 unique words
../data/100_english_novels/corpus/Galsworthy_Man_1906.txt contains 16713 unique words
../data/100_english_novels/corpus/Corelli_Innocent_1914.txt contains 19627 unique words
../data/100_english_novels/corpus/Kipling_Light_1891.txt contains 12493 unique words
../data/100_english_novels/corpus/Conrad

../data/100_english_novels/corpus/Gaskell_Ruth_1855.txt contains 18148 unique words
../data/100_english_novels/corpus/Kipling_Captains_1896.txt contains 11709 unique words


__TASK 3: SAVE THE RESULT AS A SINGLE FILE WITH COLUMNS FILENAME, TOTAL_WORDS, UNIQUE_WORDS__


In [5]:
# For this task I am going to use the Pandas module to create a dataframe and then convert it into a CSV-file using the module CSV.
import pandas as pd
import csv

# Creating an empty list that will be used later in the loop
data = {'filename': [],
        'height': [],
        'width': []}

# Creating an empty dataframe with Pandas that will be used later in the loop
dataframe = pd.DataFrame(info, columns = ['filename', 'total_words', 'unique_words'])

# Creating a loop that loops through each txt-file and appends the dataframe with the information (filename, total words, unique words)
for filename in Path(data_path).glob("*.txt"):
    with open (filename, "r", encoding = "utf-8") as file:
        novel = file.read()
        split_novel = novel.split()
        unique_words = set(split_novel)
        data = {'filename':  [filename],
                'total_words': [len(split_novel)],
                'unique_words': [len(unique_words)]}
        dataframe = dataframe.append(pd.DataFrame(data, columns = ['filename', 'total_words', 'unique_words']))
        print(dataframe) # making sure that the dataframe looks right
        csv_file = dataframe.to_csv(r'../data/100_english_novels/novel_info.csv', index = False) # converting the dataframe to a csv-file
        
# Now I have a single CSV-file called "novel_info.csv" that contains all the relevant columns and is located in the specified directory.


                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185

                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0  ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0  ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
0  ../data/100_english_novels/corpus/Forster_Howa...      111057        17065
0  ../data/100_english_novels/corpus/Wcollins_Bas...      118088        14586
0  ../data/100_english_novels/corpus/Schreiner_Un...       90672        11744
0  ../data/100_english_novels/corpus/Galsworthy_M...      110455        16713
0  ../data/100_english_novels/corpus/Corelli_Inno...      121950        19627
0  ../data/100_english_novels/corpus/Kipling_Ligh...       72479        12493
0  ../data/100_english_novels/corpus/Conrad_Nostr...      172276

                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0  ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0  ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
0  ../data/100_english_novels/corpus/Forster_Howa...      111057        17065
0  ../data/100_english_novels/corpus/Wcollins_Bas...      118088        14586
0  ../data/100_english_novels/corpus/Schreiner_Un...       90672        11744
0  ../data/100_english_novels/corpus/Galsworthy_M...      110455        16713
0  ../data/100_english_novels/corpus/Corelli_Inno...      121950        19627
0  ../data/100_english_novels/corpus/Kipling_Ligh...       72479        12493
0  ../data/100_english_novels/corpus/Conrad_Nostr...      172276

                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0  ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0  ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
0  ../data/100_english_novels/corpus/Forster_Howa...      111057        17065
0  ../data/100_english_novels/corpus/Wcollins_Bas...      118088        14586
0  ../data/100_english_novels/corpus/Schreiner_Un...       90672        11744
0  ../data/100_english_novels/corpus/Galsworthy_M...      110455        16713
0  ../data/100_english_novels/corpus/Corelli_Inno...      121950        19627
0  ../data/100_english_novels/corpus/Kipling_Ligh...       72479        12493
0  ../data/100_english_novels/corpus/Conrad_Nostr...      172276

                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0  ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0  ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
0  ../data/100_english_novels/corpus/Forster_Howa...      111057        17065
0  ../data/100_english_novels/corpus/Wcollins_Bas...      118088        14586
0  ../data/100_english_novels/corpus/Schreiner_Un...       90672        11744
0  ../data/100_english_novels/corpus/Galsworthy_M...      110455        16713
0  ../data/100_english_novels/corpus/Corelli_Inno...      121950        19627
0  ../data/100_english_novels/corpus/Kipling_Ligh...       72479        12493
0  ../data/100_english_novels/corpus/Conrad_Nostr...      172276

                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0  ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0  ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
0  ../data/100_english_novels/corpus/Forster_Howa...      111057        17065
0  ../data/100_english_novels/corpus/Wcollins_Bas...      118088        14586
0  ../data/100_english_novels/corpus/Schreiner_Un...       90672        11744
0  ../data/100_english_novels/corpus/Galsworthy_M...      110455        16713
0  ../data/100_english_novels/corpus/Corelli_Inno...      121950        19627
0  ../data/100_english_novels/corpus/Kipling_Ligh...       72479        12493
0  ../data/100_english_novels/corpus/Conrad_Nostr...      172276

                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0  ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0  ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
0  ../data/100_english_novels/corpus/Forster_Howa...      111057        17065
0  ../data/100_english_novels/corpus/Wcollins_Bas...      118088        14586
0  ../data/100_english_novels/corpus/Schreiner_Un...       90672        11744
0  ../data/100_english_novels/corpus/Galsworthy_M...      110455        16713
0  ../data/100_english_novels/corpus/Corelli_Inno...      121950        19627
0  ../data/100_english_novels/corpus/Kipling_Ligh...       72479        12493
0  ../data/100_english_novels/corpus/Conrad_Nostr...      172276

                                            filename total_words unique_words
0  ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0  ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0  ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0  ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0  ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
0  ../data/100_english_novels/corpus/Forster_Howa...      111057        17065
0  ../data/100_english_novels/corpus/Wcollins_Bas...      118088        14586
0  ../data/100_english_novels/corpus/Schreiner_Un...       90672        11744
0  ../data/100_english_novels/corpus/Galsworthy_M...      110455        16713
0  ../data/100_english_novels/corpus/Corelli_Inno...      121950        19627
0  ../data/100_english_novels/corpus/Kipling_Ligh...       72479        12493
0  ../data/100_english_novels/corpus/Conrad_Nostr...      172276

                                             filename total_words unique_words
0   ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0   ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0   ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0   ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0   ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
..                                                ...         ...          ...
0   ../data/100_english_novels/corpus/Schreiner_Tr...       24612         4832
0   ../data/100_english_novels/corpus/Cbronte_Shir...      218572        29500
0   ../data/100_english_novels/corpus/James_Ambass...      167555        17390
0   ../data/100_english_novels/corpus/Lawrence_Ser...      172356        21246
0   ../data/100_english_novels/corpus/Braddon_Ques...      174199        17608

[61 rows x 3 columns]
                             

                                             filename total_words unique_words
0   ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0   ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0   ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0   ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0   ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
..                                                ...         ...          ...
0   ../data/100_english_novels/corpus/Gaskell_Love...      191037        21087
0   ../data/100_english_novels/corpus/Corelli_Roma...      100526        15923
0   ../data/100_english_novels/corpus/Conrad_Rover...       88101        11978
0   ../data/100_english_novels/corpus/Gissing_Wome...      139234        16912
0   ../data/100_english_novels/corpus/Woolf_Years_...      130903        16701

[75 rows x 3 columns]
                             

                                             filename total_words unique_words
0   ../data/100_english_novels/corpus/Cbronte_Vill...      196557        29084
0   ../data/100_english_novels/corpus/Forster_Ange...       50477         9464
0   ../data/100_english_novels/corpus/Woolf_Lighth...       70185        11157
0   ../data/100_english_novels/corpus/Meredith_Ric...      214985        28892
0   ../data/100_english_novels/corpus/Stevenson_Tr...       68448        10831
..                                                ...         ...          ...
0   ../data/100_english_novels/corpus/Ward_Milly_1...       47588         6688
0   ../data/100_english_novels/corpus/Ford_Girl_19...       35708         7092
0   ../data/100_english_novels/corpus/Meredith_Fev...      168781        24576
0   ../data/100_english_novels/corpus/Lee_Albany_1...       62913        11628
0   ../data/100_english_novels/corpus/Ford_Soldier...       76750        10626

[88 rows x 3 columns]
                             