## Fundamentals of Python - part 4 

In this lecture you will learn how to use Python for working with Data: reading files, writing files, loading/processing/saving data with Pandas library.

## References
Mark Lutz, 'Learning Python: Powerful Object-Oriented Programming', O'Reilly Media, Inc., 2013. 

Dane Hillard, 'Practices of the Python Pro', Manning Publications, 2020.

Python Basics for Data Science (IBM PY0101EN)

w3schools.com/python/pandas

### 1. Reading files with 'open' function

One way to read or write a file in Python is to use the built-in <code>open</code> function. The <code>open</code> function provides a <b>File object</b> that contains the methods and attributes you need in order to read, save, and manipulate the file. In this notebook, we will only cover <b>.txt</b> files. The first parameter you need is the file path and the file name.

In [None]:
# open the data1.txt



The mode argument is optional and the default value is <b>r</b>. In this notebook we only cover two modes: 
<ul>
    <li><b>r</b> Read mode for reading files </li>
    <li><b>w</b> Write mode for writing files</li>
</ul>

In [None]:
#attributes .name;.mode


In [None]:
#assign a variable


In [None]:
#close file after finishing the work


<b>A better way to open the file</b><p>
Using the <code>with</code> statement is better practice, it automatically closes the file even if the code encounters an exception. The code will run everything in the indent block then close the file object. 

The file object is closed, you can verify it by running the following cell:  

In [None]:
#verify if the file is closed



We don’t have to read the entire file, for example, we can read the first 4 characters by entering three as a parameter to the method **.read()**:

In [None]:
# read the first four characters


Once the method <code>.read(4)</code> is called the first 4 characters are called. If we call the method again, the next 4 characters are called. The output for the following cell will demonstrate the process for different inputs to the method <code>read()</code>:

In [None]:
#read certain amount of characters



We can also read one line of the file at a time using the method <code>readline()</code>: 

We can use a loop to iterate through each line: 

In [None]:
#Exercise: iterate through the lines by using for loop


We can use the method <code>readlines()</code> to save the text file to a list: 

### 2. Writing files with 'open' function

We can open a file object using the method <code>write()</code> to save the text file to a list. To write the mode, argument must be set to write <b>w</b>. Let’s write a file <b>data2.txt</b> with the a text line.

In [None]:
#write line to the file


In [None]:
#read the file


In [None]:
#write several lines to file


The method <code>.write()</code> works similar to the method <code>.readline()</code>, except instead of reading a new line it writes a new line.

In [None]:
#verify the content written to the file


 By setting the mode argument to append **a**  you can append a new line as follows:

In [None]:
#write a new line to text file



In [None]:
#sample list of text


In [None]:
#write the strings in the list to text file



<b>Copy a file</b><p>
Let's copy the file data2.txt to the file data3.txt

In [None]:
#copy one file to another



 After reading files, we can also write data into files and save them in different file formats like **.txt, .csv, .xls (for excel files) etc**. Let's take a look at some examples.

### 3. PANDAS - python data analysis library

Pandas is a python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.

In [None]:
#import pandas package


<b>Series</b> is like a column in the table (1D array)

In [None]:
#create a series from a list. 


In [None]:
#add indices to the series


In [None]:
#creating a series from a dictionary


<b>Dataframes</b> - 2 dimensional tables

In [None]:
#create a dataframe


In [None]:
#plot the dataframe


In [None]:
#select data in column 'mass'


In [None]:
#select (locate) data in the first row


In [None]:
#examples of panda methods


In [None]:
#adding or changing names of the rows



In [None]:
#load a CSV (comma separated files) file into a Pandas DataFrame


In [None]:
#display the number of maximum returned rows


<b>JSON</b> (JavaScript Object Notation) big data sets are often stored or extracted as JSON

In [None]:
#load a json dataset
# https://www.kaggle.com/rtatman/iris-dataset-json-version/version/1


<b>Displaying the data</b>: head(), tail(), info()


In [None]:
#test the displaying methods


<b>Data Cleaning</b> - fixing bad data, i.e. empty cells, duplicates, wrong data, etc

In [None]:
#HW: find out the flaws in data and fix them (#)
dfhw = pd.read_csv('hw_dataset.csv')
#hints
#remove empty cells - .dropna(inplace=True)
#remove rows with a null value in the column - .dropna(subset = [...], inplace = True)
#replace empty cells (e.g. nulls to 5) - .fillna(5, inplace = True), ypu can specify the row/column
#replace with a mean (mean()/median()) value of that column
#fix the date format - .to_datetime(df[...])
#replacing values - .loc[number,'column'] = ...
#duplicates - df.duplicated(); drop_duplicates(inplace=True)
