# COSC 311: Introduction to Data Visualization and Interpretation

Instructor: Dr. Shuangquan (Peter) Wang

Email: spwang@salisbury.edu

Department of Computer Science, Salisbury University


# Module 2_Data Processing and Organization

## 1. File processing



**Contents of this note refer to 1) the teaching materials at Department of Computer Science, William & Mary; 2) the textbook "Python crash course - a hands-on project-based introduction to programming"; 3) Python toturial: https://docs.python.org/3/tutorial/**

**<font color=red>All rights reserved. Dissemination or sale of any part of this note is NOT permitted.</font>**

# File

What is a file?

File is a collection of data. It is static and stored somewhere in your computer or cloud.

## How to open a file

Syntax:

**file_object = open(file_name,mode to open the file)**

Open file and return a corresponding file object.

Mode to open the file: (refer to https://stackabuse.com/file-handling-in-python/)
- r: Opens a file for **reading only** (*by default*)
- r+: Opens a file for **both reading and writing**
- w: Opens a file for **writing only**
- w+: Open a file for **writing and reading**
- a: Opens a file for **appending**
- a+: Opens a file for both **appending and reading**

Python looks for the file in the directory where the program that's currently being executed is stored (i.e. the data file and the program file are in the same folder).

If the data file and the program file are NOT in the same folder, the file path is needed. An example for Windows system: file_name = 'C:\python_files\filename.txt'

## How to read the entire file

Syntax:

**file_object.read()**

Read the entire contents of the file and store it as a long string

## How to read a line from the file

Syntax:

**file_object.readline()**

## How to close a file

Syntax:

**file_object.close()**

You should call **file_object.close()** to close the file and immediately free up any system resources used by it. 

If you don’t explicitly close a file, Python’s garbage collector will eventually destroy the object and close the open file for you, but the file may stay open for a while.

You can't guarantee that the changes will be saved to the file until it is closed. Thus, it is bad to rely on garbage collector to close your file for you.

In [None]:
# example: read the entire file
my_file = open('scores_1.txt','r')
content = my_file.read()
print(content)
my_file.close()

In [None]:
# example: read a line
my_file = open('scores_1.txt','r')
my_file.readline()

In [None]:
# pay attention to the output format
print(my_file.readline())

In [None]:
my_file.readline()

In [None]:
my_file.readline()

In [None]:
my_file.readline()

In [None]:
my_file.readline()

In [None]:
my_file.readline()

In [None]:
my_file.readline()

In [None]:
my_file.readline()

In [None]:
my_file.readline()

In [None]:
my_file.readline()

In [None]:
my_file.close()

**Example:**

Calculating the mean value of numbers in above file.

In [None]:
# Calculating the mean value of numbers in file scores_1.txt
my_file = open('scores_1.txt','r')
all_scores = 0
counter = 0

new_score = my_file.readline()
while new_score != '':
    all_scores += int(new_score)
    counter += 1
    new_score = my_file.readline()
print(all_scores/counter)
my_file.close()

**How about if there are some empty lines in the file? For example: file scores_2.txt**

In [None]:
# Calculating the mean value of numbers in file scores_2.txt
my_file = open('scores_2.txt','r')
all_scores = 0
counter = 0

new_score = my_file.readline()
while new_score != '':
    if new_score != '\n':
        all_scores += int(new_score)
        counter += 1
    new_score = my_file.readline()
print(all_scores/counter)
my_file.close()

**How about if there are multiple items in one line (e.g: name & score)? For example: scores_3.txt**

- **split(separator)** method returns a list of strings after breaking the given string by the specified separator (https://www.geeksforgeeks.org/python-string-split/)

- If separator is not specified or is None, consecutive whitespace are regarded as a single separator (https://docs.python.org/2/library/stdtypes.html#str.split)

- Whitespace include spaces, newlines '\n' and tabs '\t', and consecutive whitespace are processed together (https://note.nkmk.me/en/python-split-rsplit-splitlines-re/)

In [None]:
# Example
a = 'Linda 85\n'
a.split()

In [None]:
# Calculating the mean value of numbers in file scores_3.txt
my_file = open('scores_3.txt','r')
all_scores = 0
counter = 0

new_score = my_file.readline()
while new_score != '':
    name_score_list = new_score.split()
    all_scores += int(name_score_list[1])
    counter += 1
    new_score = my_file.readline()
print(all_scores/counter)
my_file.close()

#  

## Write to a file

Syntax:

**file_object.write(content)**

Write content to the file. 

Pay attention: 

- You need to open the file using an appropriate mode (r+, w, w+, a, a+) before writing

- Python can only write strings to a text file. If you want to store numerical data in a text file, you have to convert the data to string using str() method.

**Example:**

Write 1,000 random numbers between 0 and 100 to a file named numbers.txt. Each number is in a line.

In [None]:
from random import randint

new_file = open('numbers.txt','w')
for i in range(1000):
    number = randint(0,100)
    new_file.write(str(number) + '\n')
new_file.close()

### Another way to open a file

Syntax:

**with open(file_name,mode to open the file) as file_object:**

It is good practice to use the **with** keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes (No need to close the file using *file_object.close()*), even if an exception is raised at some point.

In [None]:
# use with keyword in above example
from random import randint

with open('numbers_2.txt','w') as new_file:
    for i in range(1000):
        number = randint(0,100)
        new_file.write(str(number) + '\n')


### Use print to write to a file

Syntax: 

**print(content,file = file_object)**

No need to convert the content to string. 

In [None]:
# use print to write to a file in above example
from random import randint

with open('numbers_3.txt','w') as new_file:
    for i in range(1000):
        number = randint(0,100)
        print(number,file = new_file)

## Exceptions for file

In [None]:
# when the file is not existed
f = open('goaway.txt','r')

In [None]:
# Use try & except method
try:
    f = open('goaway.txt','r')
except:
    print('file open error')

In [None]:
# more specifically, we specify the error type
try:
    f = open('goaway.txt','r')
except FileNotFoundError:
    print('file does not exist.')

In [None]:
# multiple exceptions
try:
    #f = open('goaway.txt','r')
    my_list = [1,2,3,4]
    print(my_list[5])
except FileNotFoundError:
    print('file does not exist.')
except IndexError:
    print('Index out of range of your list')

#   

## Application example

Open and read the file named 'Bovary_Excerpt.txt', create a dictionary that uses each word in this file as a key and the appearance frequency of this word as the value. 

In [None]:
word_counts = {}
with open('Bovary_Excerpt.txt','r') as file:
    for line in file:
        tokens = line.upper().replace(',','').replace(';','').replace('(','').replace(')','')\
        .replace('!','').replace('?','').replace('.','').split()
        for word in tokens:
            try:
                word_counts[word] += 1
            except:
                word_counts[word] = 1

In [None]:
print(word_counts['AVEC'])

In [None]:
print(word_counts['ELLE'])

Based on the above dictionary, how about if we reverse the key and value in the dictionary? That is, we use the appearence frequency as the key and use the words as the value.

In [None]:
word_lists = {}
for word,count in word_counts.items():
    try:
        word_lists[count].append(word)
    except:
        word_lists[count] = [word]
print(word_lists[2])

In [None]:
appearances = list(word_lists.keys()) 
#for value in word_lists.values():
#    print(len(value))
num_words = [len(value) for value in word_lists.values()]
avg_len = [sum([len(word) for word in value]) / len(value) for value in word_lists.values()] # avg. word length for each key 

In [None]:
print(appearances)
print(num_words)
print(avg_len)

In [None]:
import matplotlib.pyplot as plt 
plt.bar(appearances, avg_len)

**matplotlib.pyplot**

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html

matplotlib.pyplot is an interface to matplotlib, which is a comprehensive library for creating static, animated, and interactive visualizations in Python. It provides a MATLAB-like way of plotting.

matplotlib.pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 5, 0.1)
print(x)
y = np.sin(x)
print(y)
plt.plot(x, y)