## Fundamentals of Python - part 4 (Solutions) 

In this lecture you will learn how to use Python for working with Data: reading files, writing files, loading/processing/saving data with Pandas library.

## References
Mark Lutz, 'Learning Python: Powerful Object-Oriented Programming', O'Reilly Media, Inc., 2013. 

Dane Hillard, 'Practices of the Python Pro', Manning Publications, 2020.

Python Basics for Data Science (IBM PY0101EN)

w3schools.com/python/pandas

### 1. Reading files with 'open' function

One way to read or write a file in Python is to use the built-in <code>open</code> function. The <code>open</code> function provides a <b>File object</b> that contains the methods and attributes you need in order to read, save, and manipulate the file. In this notebook, we will only cover <b>.txt</b> files. The first parameter you need is the file path and the file name.

In [7]:
with open('data1.txt', 'w') as fp:
    pass

In [8]:
lines = ['Beautiful is better than ugly.',
'Explicit is better than implicit.',
'Simple is better than complex.',
'Complex is better than complicated.']
with open('data1.txt', 'w') as f:
    for line in lines:
        f.write(line)
        f.write('\n')

In [9]:
# open the data1.txt

# filepath = "/resources/data/"
filename1 = 'data1.txt'
file1 = open(filename1, "r")

The mode argument is optional and the default value is <b>r</b>. In this notebook we only cover two modes: 
<ul>
    <li><b>r</b> Read mode for reading files </li>
    <li><b>w</b> Write mode for writing files</li>
</ul>

In [10]:
#attributes .name;.mode
print(file1.name)
print(file1.mode)

data1.txt
r


In [11]:
#assign a variable
philosophy_statements = file1.read()
philosophy_statements

'Beautiful is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.\n'

In [12]:
print(philosophy_statements)

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.



In [13]:
type(philosophy_statements)

str

In [14]:
#close file after finishing the work
file1.close()

<b>A better way to open the file</b><p>
Using the <code>with</code> statement is better practice, it automatically closes the file even if the code encounters an exception. The code will run everything in the indent block then close the file object. 

In [15]:
with open(filename1, "r") as file1:
    statements = file1.read()
    print(statements)

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.



The file object is closed, you can verify it by running the following cell:  

In [None]:
#verify if the file is closed

file1.closed

We don’t have to read the entire file, for example, we can read the first 4 characters by entering three as a parameter to the method **.read()**:

In [None]:
# read the first four characters

with open(filename1, "r") as file1:
    print(file1.read(4))

Once the method <code>.read(4)</code> is called the first 4 characters are called. If we call the method again, the next 4 characters are called. The output for the following cell will demonstrate the process for different inputs to the method <code>read()</code>:

In [None]:
#read certain amount of characters

with open(filename1, "r") as file1:
    print(file1.read(10))
    print(file1.read(10))
    print(file1.read(10))

We can also read one line of the file at a time using the method <code>readline()</code>: 

In [None]:
with open(filename1, "r") as file1:
    print("first line: " + file1.readline())

We can use a loop to iterate through each line: 

In [None]:
#Exercise: iterate through the lines by using for loop

with open(filename1,"r") as file1:
        i = 1;
        for line in file1:
            print("line", str(i), ": ", line)
            i = i + 1;

We can use the method <code>readlines()</code> to save the text file to a list: 

In [None]:
with open(filename1, "r") as file1:
    FileasList = file1.readlines()

In [None]:
FileasList

### 2. Writing files with 'open' function

We can open a file object using the method <code>write()</code> to save the text file to a list. To write the mode, argument must be set to write <b>w</b>. Let’s write a file <b>data2.txt</b> with the a text line.

In [None]:
#write line to the file
filepath = ''
filenamea = 'data2.txt'
with open(filepath+filename, 'w') as writefile:
    writefile.write("Complex is better than complicated")

In [None]:
#read the file
with open(filepath+filename, 'r') as testwritefile:
    print(testwritefile.read())

In [None]:
#write several lines to file
with open(filepath+filename, 'w') as writefile:
    writefile.write("Complex is better than complicated.\n")
    writefile.write("Readability counts.\n")

The method <code>.write()</code> works similar to the method <code>.readline()</code>, except instead of reading a new line it writes a new line.

In [None]:
#verify the content written to the file
with open(filepath+filename, 'r') as testwritefile:
    print(testwritefile.read())

 By setting the mode argument to append **a**  you can append a new line as follows:

In [None]:
#write a new line to text file

with open(filepath+filename, 'a') as testwritefile:
    testwritefile.write("Knights who say Ni to Arthur.\n")

In [None]:
with open(filepath+filename, 'r') as testwritefile:
    print(testwritefile.read())

In [None]:
#sample list of text
textLines = ["Just do it.\n", "To infinity and beyond.\n", "To be or not to be.\n"]
textLines

In [None]:
#write the strings in the list to text file

with open(filepath+filename, 'w') as writefile:
    for line in textLines:
        print(line)
        writefile.write(line)

In [None]:
with open(filepath+filename, 'a') as writefile:
    for line in FileasList:
        print(line)
        writefile.write(line)

<b>Copy a file</b><p>
Let's copy the file data2.txt to the file data3.txt

In [None]:
#copy one file to another

with open('data2.txt','r') as readfile:
    with open('data3.txt','w') as writefile:
          for line in readfile:
                writefile.write(line)

In [None]:
with open('data3.txt','r') as testwritefile:
    print(testwritefile.read())

 After reading files, we can also write data into files and save them in different file formats like **.txt, .csv, .xls (for excel files) etc**. Let's take a look at some examples.

### 3. PANDAS - python data analysis library

Pandas is a python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.

In [1]:
#import pandas package
import pandas as pd

<b>Series</b> is like a column in the table (1D array)

In [22]:
#create a series from a list. 
calories = [420,560,230]
aSer = pd.Series(calories)
aSer

0    420
1    560
2    230
dtype: int64

In [23]:
#add indices to the series
ind = ['day1', 'day2', 'day3']
aSer = pd.Series(calories,ind)
aSer

day1    420
day2    560
day3    230
dtype: int64

In [28]:
#creating a series from a dictionary
calories_dic = {'day1':420,'day2':560, 'day3':230}
aSer2 = pd.Series(calories_dic, index = ['day2', 'day3'])
aSer2

day2    560
day3    230
dtype: int64

<b>Dataframes</b> - 2 dimensional tables

In [44]:
#create a dataframe
data2D = {'mass':[1,2.54,3.12],'length':[1.3,2.2,5.3]}
df1=pd.DataFrame(data2D)

In [36]:
#plot the dataframe
df1.head(2)

Unnamed: 0,mass,length
0,1.0,1.3
1,2.54,2.2


In [37]:
#select data in column 'mass'
df1['mass']

0    1.00
1    2.54
2    3.12
Name: mass, dtype: float64

In [38]:
#select (locate) data in the first row
df1.loc[0]

mass      1.0
length    1.3
Name: 0, dtype: float64

In [39]:
#examples of panda methods
df2 = df1[df1<3]
df2

Unnamed: 0,mass,length
0,1.0,1.3
1,2.54,2.2
2,,


In [45]:
#adding or changing names of the rows
df1 = pd.DataFrame(data2D,index = ["case1", "case2", "case3"])
df1
print(df1.loc["case1"])


mass      1.0
length    1.3
Name: case1, dtype: float64


In [48]:
#load a CSV (comma separated files) file into a Pandas DataFrame
import pandas as pd
df = pd.read_csv('metasurface_phaseMap.csv')
print(df)

      2.100000000000000000e+02  1.930000000000000000e+02  \
0                        193.0                     176.0   
1                        176.0                     159.0   
2                        159.0                     143.0   
3                        143.0                     126.0   
4                        126.0                     109.0   
...                        ...                       ...   
1556                     143.0                     126.0   
1557                     159.0                     143.0   
1558                     176.0                     159.0   
1559                     193.0                     176.0   
1560                     210.0                     193.0   

      1.760000000000000000e+02  1.590000000000000000e+02  \
0                        159.0                     143.0   
1                        143.0                     126.0   
2                        126.0                     109.0   
3                        109.0         

In [49]:
#display the number of maximum returned rows
print(pd.options.display.max_rows)

60


In [50]:
pd.options.display.max_rows = 9999
print(df)

      2.100000000000000000e+02  1.930000000000000000e+02  \
0                        193.0                     176.0   
1                        176.0                     159.0   
2                        159.0                     143.0   
3                        143.0                     126.0   
4                        126.0                     109.0   
5                        109.0                      93.0   
6                         93.0                      76.0   
7                         76.0                      59.0   
8                         60.0                      43.0   
9                         43.0                      26.0   
10                        26.0                      10.0   
11                        10.0                     353.0   
12                       353.0                     337.0   
13                       337.0                     320.0   
14                       320.0                     304.0   
15                       304.0          

<b>JSON</b> (JavaScript Object Notation) big data sets are often stored or extracted as JSON

In [51]:
#load a json dataset
# https://www.kaggle.com/rtatman/iris-dataset-json-version/version/1
dfjs = pd.read_json('iris.json')

In [52]:
dfjs

Unnamed: 0,sepalLength,sepalWidth,petalLength,petalWidth,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


<b>Displaying the data</b>: head(), tail(), info()


In [55]:
#test the displaying methods
dfjs.info()
dfjs.tail()
dfjs.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepalLength    150 non-null float64
sepalWidth     150 non-null float64
petalLength    150 non-null float64
petalWidth     150 non-null float64
species        150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Unnamed: 0,sepalLength,sepalWidth,petalLength,petalWidth,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


<b>Data Cleaning</b> - fixing bad data, i.e. empty cells, duplicates, wrong data, etc

In [74]:
#HW: find out the flaws in data and fix them (#)
dfhw = pd.read_csv('hw_dataset.csv')
#hints
#remove empty cells - .dropna(inplace=True)
#remove rows with a null value in the column - .dropna(subset = [...], inplace = True)
#replace empty cells (e.g. nulls to 5) - .fillna(5, inplace = True), ypu can specify the row/column
#replace with a mean (mean()/median()) value of that column
#fix the date format - .to_datetime(df[...])
#replacing values - .loc[number,'column'] = ...
#duplicates - df.duplicated(); drop_duplicates(inplace=True)


In [75]:
dfhw

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0
