# Files
- All files have already been created. You can choose to not run this notebook. If you do run it, the files will just be overwritten. Nothing will change, so go for it!
- I manually edited the largest data file due to some weird comma-parsing errors.

# Final Dataset Format
### The dataset file with all information is named "dataset.csv".

The attributes in the file are comma separated.

The order of the attributes is:
1. Number of nouns
2. Number of foreign words
3. Number of prepositions
4. Number of determiners
5. Number of adjectives
6. Nationality of artist
7. Gender of artist

### The dataset file with gender only is named "gender_dataset.csv".

The attributes in the file are comma separated.

The order of the attributes is:

1. Number of nouns
2. Number of foreign words
3. Number of prepositions
4. Number of determiners
5. Number of adjectives
6. Gender of artist

### The dataset file with nationality only is named "nationality_dataset.csv".

The attributes in the file are comma separated.

The order of the attributes is:

1. Number of nouns
2. Number of foreign words
3. Number of prepositions
4. Number of determiners
5. Number of adjectives
6. Nationality of artist

Note: we only use gender_dataset.csv, as our goal is to predict artist gender from titles, however we create a dataset for predicting artist nationality in the future. 

# Downloading the dataset
You don't need to do this step!
First, I created a directory called "dataset."

From inside the dataset directory, I downloaded the MOMA dataset.

These files are too large to be pushed to Github. Here are the commands I ran:

mkdir dataset

git clone https://github.com/MuseumofModernArt/collection

# Download NLTK
- Download Python 3.7 (https://www.python.org/downloads/)
- Install numpy by running this command: pip install numpy
- Install NLTK by running this command: pip install nltk

# Parse Dataset
We only want to keep three things from our dataset: the title, the gender, and the nationality. We are using the data.csv file.

The data.csv file is a file I made some manual edits to in Excel for ease- I deleted some columns and punctuation.

Notice the title is the first element in each line. The nationality is the fifth element in every line. The gender is the eighth element in every line.

The script below opens the dataset and pulls just the data we want from every line. Then it writes it a new file.

In [3]:
##########################
#Opening and Making Files
##########################

# Open the file, which we will name 'file'
# If you run this yourself, make sure Artworks.csv is in the same directory or change the file path
# The 'r' parameter is saying that we only need to read this file, not write to it
file = open("data.csv", "r", encoding="utf-8")

# Lines is a Python list (like an array) of the lines in the file
lines = file.readlines()

# We also need to make a new file to write our data to.
# "w+" is how we tell Python that we are writing to this file (the '+' means create it if it doesn't already exist)
new_file = open("temp_dataset.txt", "w+", encoding="utf-8")
file.close()

In [4]:
##########################
# Practice with Lines
##########################

# We can access the elements in the list of lines.
# For example, this code will print out the first line:
print (lines[0])
# The first line tells us what the columns mean! 
# That's convenient- looks like we need column 0, 4, and 7.
# Here is the first line of actual data:
print (lines[1])
# Now look at line 8:
print (lines[8])
# Some of our data is missing Nationalities or genders!
# We need to deal with this. 
# Since we have a lot of data for a decision tree, I just removed them.

﻿Title,Nationality,Gender

Ferdinandsbrücke Project Vienna Austria Elevation preliminary version,Austrian,Male

The Manhattan Transcripts Project New York New York Episode 1: The Park,,Male



In [5]:
################################
# Writing our data to a new file
################################

# This is a for each loop that goes over each line.
for line in lines:
    #split each line on commas
    elements = line.split(',')
    # Was all the information there?
    if (len(elements) == 3):
        # Check one more time because Python reasons
        if (elements[0] != "" and elements[1] != "" and elements[2] != ""):
            # Sometimes more than one nationality was listed. Let's take the first one.
            nationality = elements[1].split(" ")[0]
            # Sometimes more than one gender was listed. Let's take the first one.
            gender = elements[2].split(" ")[0]
            # Sometimes random nonsense got into our data. Let's make sure the gender is 'valid' (i.e. in this dataset)
            if (gender == "Male" or gender =="Female"):
                new_line = elements[0] + "," + nationality + "," + gender + "\n"
                new_file.write(new_line)
# Close our files!
new_file.close()

# Using NLTK
Here is the outline of what we will do now:

1. import the Natural Language ToolKit.
2. Go through each line of new file.
3. Tokenize and Tag the title on each line.
4. Use the tagging to count the FOREIGN WORDS (FW), NOUNS (NN or NNS or NNP or NNPS), PREPOSITIONS (IN), ADJECTIVES (JJ, JJR, JJS), and DETERMINERS (DT).
5. Write this information to a new csv file.

### Explanation of Tags:
- FW is any foreign word (this may impact our results! We are using Anglocentric software...Perhaps our model will be excellent at predicting American/British nationalities)
- NN is a singular noun. NNS is a plural noun. NNP is a singular proper noun. NNPS is a plural proper noun.
- IN is a preposition.
- JJ is a regular adjective (big). JJR is a comparative adjective (bigger). JJS is a superlative adjective (biggest).
- DT is a determiner (e.g. "the")

In [6]:
# Import NLTK
import nltk
# Download some extra stuff
# I promise I'm not actually punking you
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.floa

True

In [7]:
# Open up our temp data file
old_data = open("temp_dataset.txt", "r", encoding="utf-8")
old_lines = old_data.readlines()
old_data.close()
# Make our final dataset file
dataset = open("dataset.csv", "w+", encoding="utf-8")
gender_dataset = open("gender_dataset.csv", "w+", encoding="utf-8")
nat_dataset = open("nationality_dataset.csv", "w+", encoding="utf-8")

In [8]:
####################
# Practice with NLTK
####################

#Grab a random line from our dataset to play with
temp = old_lines[1329]
print(old_lines[1329])
#Split the line on commas
temp_elements = temp.split(",")
#Print the title we are playing with
print(temp_elements[0])
#Tokenize the title
text = nltk.word_tokenize(temp_elements[0])
#Tag the title
tags = nltk.pos_tag(text)

# tags is a tuple (basically a 2D array)
# tags[X][0] is the word we tagged, tags[X][1] is the POS tag
# for example, tags[0][1] is the tag for the first word in the title

print(tags)

Seagull - Bikini of God,American,Male

Seagull - Bikini of God
[('Seagull', 'NNP'), ('-', ':'), ('Bikini', 'NNP'), ('of', 'IN'), ('God', 'NNP')]


# Some notes on NLTK tagging
Wow, sometimes the sentence tagging is AWFUL
Options:
1. Tag everything ourselves
2. Ignore the problem
3. Justify the problem

Option 3: Well, it's bad, but... our research doesn't necessarily rely on the tagging being accurate, just consistent. Ultimately, it doesn't matter if NLTK things something is a noun and it isn't as long as it's consistent. We are looking for patterns. We should acknoledge that this means we can't make claims about noun usage in women's art titles in general, only through the lens of NLTK. Part of our presentation could be about limitations.

In [9]:
# go through each line
for line in old_lines:
    
    NN = 0 # number of nouns
    FW = 0 # number of foreign words
    IN = 0 # number of prepositions
    DET = 0 # number of determiners
    ADJ = 0 # number of adjectives
    
    # get the elements from each line in our temp_data file
    elements = line.split(",")
    
    # tokenize and tag the title
    text = nltk.word_tokenize(elements[0])
    tags = nltk.pos_tag(text)
    
    # go through each POS tag and count tags
    for tag in tags:
        if (tag[1] == "NN" or tag[1] == "NNS" or tag[1] == "NNP" or tag[1] == "NNPS"):
            if(NN < 20):
                NN = NN + 1
        if (tag[1] == "FW"):
            if(FW < 1):
                FW = FW + 1
        if (tag[1] == "IN"):
            if(IN < 1):
                IN = IN + 1
        if (tag[1] == "DET"):
            if(DET < 1):
                DET = DET + 1
        if (tag[1] == "JJ" or tag[1] == "JJR" or tag[1] == "JJS"):
            if(ADJ < 1):
                ADJ = ADJ + 1
    write_line = str(NN) + "," + str(FW) + "," + str(IN) + "," + str(DET) + "," + str(ADJ) + elements[1] + elements[2]
    dataset.write(write_line)
    write_line = str(NN) + "," + str(FW) + "," + str(IN) + "," + str(DET) + "," + str(ADJ) + "," + elements[2]
    gender_dataset.write(write_line)
    write_line = str(NN) + "," + str(FW) + "," + str(IN) + "," + str(DET) + "," + str(ADJ) + "," + elements[1] + "\n"
    nat_dataset.write(write_line)

#close data file
dataset.close()
gender_dataset.close()
nat_dataset.close()