# Python for Linguists

Notebook 6: CSV format

Venelin Kovatchev

University of Barcelona 2020

#### CSV format

CSV is a special, "coma separated" format of files that is often used in programming.

In order to process csv, you need to "import" the csv library.

In [1]:
import csv

In [2]:
# The following code reads the csv file as a normal file
file_name = "movie_plots.csv"
num_lines = 0
raw_corpus = []

with open(file_name,"r",encoding="utf-8") as inp:
    for file_line in inp:
        raw_corpus.append(file_line)
        num_lines +=1
        if num_lines >= 5:
            break
            
# As you can see, it works, however it is hard to read
# You can see in the last line that reading the csv as a text file splits the line in two
print(raw_corpus)
print("\n")
print(raw_corpus[3])
print(raw_corpus[4])


['Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot\n', '1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man\'s bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation\'s face before a group of policemen appear and order everybody to leave.[1]"\n', '1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Light_of_the_Moon,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon\'s sm

In [3]:
# The following code reads the csv file as a list using the csv library
file_name = "movie_plots.csv"
num_lines = 0
list_corpus = []

with open(file_name,"r",encoding="utf-8") as inp:
    reader=csv.reader(inp)
    for file_row in reader:
        list_corpus.append(file_row)
        num_lines +=1
        if num_lines >= 5:
            break
            
# This is better than before - it's easier to read and doesn't split entries
print(list_corpus)
print("\n")
print(list_corpus[3])
print(list_corpus[4])

[['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page', 'Plot'], ['1901', 'Kansas Saloon Smashers', 'American', 'Unknown', '', 'unknown', 'https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers', "A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"], ['1901', 'Love by the Light of the Moon', 'American', 'Unknown', '', 'unknown', 'https://en.wikipedia.org/wiki/Love_by_the_Light_of_the_Moon', "The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and

In [4]:
# The following code reads the csv file in a dictionary
file_name = "movie_plots.csv"
num_lines = 0
dict_corpus = []

with open(file_name,"r",encoding="utf-8") as inp:
    dict_reader=csv.DictReader(inp)
    for file_row in dict_reader:
        dict_corpus.append(file_row)
        num_lines +=1
        if num_lines >= 5:
            break
            
# This is better than before, now we know what does each field stand
print(dict_corpus)
print("\n")
print(dict_corpus[3])
print(dict_corpus[4])


# We can access specific "data" for each movie
print("\n The Title of the third entry is:")
print(dict_corpus[2]["Title"])

[OrderedDict([('Release Year', '1901'), ('Title', 'Kansas Saloon Smashers'), ('Origin/Ethnicity', 'American'), ('Director', 'Unknown'), ('Cast', ''), ('Genre', 'unknown'), ('Wiki Page', 'https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers'), ('Plot', "A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]")]), OrderedDict([('Release Year', '1901'), ('Title', 'Love by the Light of the Moon'), ('Origin/Ethnicity', 'American'), ('Director', 'Unknown'), ('Cast', ''), ('Genre', 'unknown'), ('Wiki Page', 'https://en.wikipedia.org/wiki/Love_by_the_Light_

In [5]:
# Task 1: manually creating a dictionary
# 
# This task will practice your general programming skills and your understanding of the csv formats
# The objective is to manually create the "dictionary" from the csv using the normal "list" reader
#
# The csv.reader reads the file line by line and returns a list
# Your task is to convert this list into a dictionary
# First you need to go through the first line in the file and get the "names" of the different fields
# Then, for each line in the file you need to read the "values" and assign them to the corresponding dictionary fields



# Solution

# We srtart with example 3, reading the csv into a list

file_name = "movie_plots.csv"
num_lines = 0
list_corpus = []

with open(file_name,"r",encoding="utf-8") as inp:
    reader=csv.reader(inp)
    for file_row in reader:
        list_corpus.append(file_row)
        num_lines +=1
        if num_lines >= 5:
            break
            
# We initialize a new counter that we will use when going through the "list"
cur_position = 0

# We initialize a list where we want to put the result
dict_corpus_2 = []

# We go through the list line by line:
for line_list in list_corpus:
    # We check if this is the first line
    if cur_position == 0:
        # If it is, then we need to take the "keys", from this position
        list_keys = line_list
    else:
        # If it is NOT, then we need to create a dictionary
        # We create an empty dictionary every time
        cur_dict = {}
        
        # We make a "trick" to get the IDs of each list
        # len(line_list) will give us the number of elements in the list (8)
        # range(len(line_list)) will gives us a list from 0 to the number of elements [0,1,2,3,4,5,6,7]
        for cur_element in range(len(line_list)):
            # We combine the "position" from the key list with the same position in the value list
            cur_dict[list_keys[cur_element]]=line_list[cur_element]
        
        # We add the dictionary to the list of the results
        dict_corpus_2.append(cur_dict)
    # Increase the counter
    cur_position +=1
        
print(dict_corpus[3])  
print(dict_corpus_2[3])


OrderedDict([('Release Year', '1901'), ('Title', 'Terrible Teddy, the Grizzly King'), ('Origin/Ethnicity', 'American'), ('Director', 'Unknown'), ('Cast', ''), ('Genre', 'unknown'), ('Wiki Page', 'https://en.wikipedia.org/wiki/Terrible_Teddy,_the_Grizzly_King'), ('Plot', 'Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading "His Photographer" and "His Press Agent" respectively, follow him into the shot; the photographer sets up his camera. "Teddy" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. "Teddy" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. "Teddy" rides th

In [6]:
final_dict = []

for curr_line in list_corpus[1:]:
    output = dict([(key,value) for key, value in zip(list_corpus[0],curr_line)])
    final_dict.append(output)
final_dict

[{'Release Year': '1901',
  'Title': 'Kansas Saloon Smashers',
  'Origin/Ethnicity': 'American',
  'Director': 'Unknown',
  'Cast': '',
  'Genre': 'unknown',
  'Wiki Page': 'https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers',
  'Plot': "A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"},
 {'Release Year': '1901',
  'Title': 'Love by the Light of the Moon',
  'Origin/Ethnicity': 'American',
  'Director': 'Unknown',
  'Cast': '',
  'Genre': 'unknown',
  'Wiki Page': 'https://en.wikipedia.org/wiki/Love_by_the_Light_of_the_Moon',
  'Plot': "Th