# Reading and Analyzing CSV Data

This script provides examples of reading and analyzing comma separated values (CSV) files.

- Created by: Tomer Burg
- Last updated: 22 March 2022

Let's start with importing the packages we'll need. If you don't have these packages, make sure to install them - they're highly useful!

In [1]:
import os
import numpy as np
import pandas as pd

## Method 1: Pure python

Provided in this directory is a `temperature.csv` file, containing hypothetical temperatures for a whole month. This file has 3 columns: Day, Max, Min. The latter two represent the maximum and minimum degrees in Fahrenheit.

Python provides a built-in `open()` function, which is used to read and write to files. This function takes 2 arguments:

- File name (relative or full path)
- Flag indicating what to do with the file:
  - "r" = read the file
  - "a" = append to the file (write to the file starting with the end of the file's content)
  - "w" = write to the file (overwriting its previous content)

The output is a pointer to the file, which we'll store in a variable `f`.

In [2]:
#We're using this function to read the contents of "temperature.csv", and storing it in the variable "f".
f = open("temperature.csv","r")

This next block of code reads all the lines within the file, and stores it in a variable `content`. We'll also close the pointer to the file, as we now have all the information we need from it.

In [3]:
content = f.readlines()
f.close()

`content` is now a list containing all lines within the file. Note that they're by default read in as strings.

In [4]:
print(content)

['Day,Max,Min\n', '1,57,32\n', '2,65,33\n', '3,74,52\n', '4,76,38\n', '5,40,22\n', '6,28,12\n', '7,26,8\n', '8,31,-2\n', '9,35,2\n', '10,28,12\n', '11,26,5\n', '12,22,-6\n', '13,31,-1\n', '14,36,9\n', '15,49,22\n', '16,55,32\n', '17,53,48\n', '18,56,27\n', '19,39,22\n', '20,35,24\n', '21,41,26\n', '22,38,28\n', '23,33,25\n', '24,35,22\n', '25,44,19\n', '26,52,29\n', '27,61,41\n', '28,64,52']


Notice that every entry ends in a line break `\n` character. We can use list comprehension to remove this:

In [5]:
content = [i.split("\n")[0] for i in content]
print(content)

['Day,Max,Min', '1,57,32', '2,65,33', '3,74,52', '4,76,38', '5,40,22', '6,28,12', '7,26,8', '8,31,-2', '9,35,2', '10,28,12', '11,26,5', '12,22,-6', '13,31,-1', '14,36,9', '15,49,22', '16,55,32', '17,53,48', '18,56,27', '19,39,22', '20,35,24', '21,41,26', '22,38,28', '23,33,25', '24,35,22', '25,44,19', '26,52,29', '27,61,41', '28,64,52']


Let's start by organizing the data in a dictionary. To create the keys for the dictionary, we'll look at the first element of `content`, as this contains the column names.

In [6]:
#Define an empty dictionary to store the data in
data = {}

#Create an empty list for every column, with its key corresponding to the column name
column_names = content[0].split(",")
for key in column_names:
    data[key] = []

print(data)

{'Day': [], 'Max': [], 'Min': []}


We're now ready to populate our dictionary with data from the file! For this, we'll iterate over every line of content after the first line (since that was the header line).

As we iterate over every line, we'll split it into a comma-separated list, then iterate over the column names using Python's `enumerate` function, matching each column's entry with its corresponding column name and dictionary entry. Since the entries are all strings, we'll also convert them to integers.

In [7]:
for line in content[1:]:
    line_array = line.split(",")
    for idx,key in enumerate(column_names):
        data[key].append(int(line_array[idx]))

print(data)

{'Day': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], 'Max': [57, 65, 74, 76, 40, 28, 26, 31, 35, 28, 26, 22, 31, 36, 49, 55, 53, 56, 39, 35, 41, 38, 33, 35, 44, 52, 61, 64], 'Min': [32, 33, 52, 38, 22, 12, 8, -2, 2, 12, 5, -6, -1, 9, 22, 32, 48, 27, 22, 24, 26, 28, 25, 22, 19, 29, 41, 52]}


The above loop may look complicated, but it's essentially a condensed way of doing the following:

In [8]:
data_2 = {
    'Day': [],
    'Max': [],
    'Min': [],
}

for line in content[1:]:
    line_array = line.split(",")
    
    line_day = int(line_array[0])
    line_max = int(line_array[1])
    line_min = int(line_array[2])
    
    data_2['Day'].append(line_day)
    data_2['Max'].append(line_max)
    data_2['Min'].append(line_min)

Let's say we want to look at the maximum and minimum temperatures on the 5th of the month. This is how we would do it:

In [9]:
idx = data['Day'].index(5)
max_temp = data['Max'][idx]
min_temp = data['Min'][idx]
print(f"Max: {max_temp}, Min: {min_temp}")

Max: 40, Min: 22


## Method 2: Read with Numpy

Numpy also provides functionality to read CSV files. A major caveat to this, however, is that as Numpy arrays require all elements to be of the same type, it cannot read a file with string headers followed by integer/float entries.

For this purpose, a file without headers `temperature_noheader.csv` is provided in this directory as well.

In [10]:
data = np.loadtxt("temperature_noheader.csv",delimiter=",")

Let's look at what the file contains:

In [11]:
print(data)

[[ 1. 57. 32.]
 [ 2. 65. 33.]
 [ 3. 74. 52.]
 [ 4. 76. 38.]
 [ 5. 40. 22.]
 [ 6. 28. 12.]
 [ 7. 26.  8.]
 [ 8. 31. -2.]
 [ 9. 35.  2.]
 [10. 28. 12.]
 [11. 26.  5.]
 [12. 22. -6.]
 [13. 31. -1.]
 [14. 36.  9.]
 [15. 49. 22.]
 [16. 55. 32.]
 [17. 53. 48.]
 [18. 56. 27.]
 [19. 39. 22.]
 [20. 35. 24.]
 [21. 41. 26.]
 [22. 38. 28.]
 [23. 33. 25.]
 [24. 35. 22.]
 [25. 44. 19.]
 [26. 52. 29.]
 [27. 61. 41.]
 [28. 64. 52.]]


We now have the data stored in a 2D Numpy array as floats. Note the problem with this is that we don't have access to headers, so we need to know beforehand what each column and row represents.

Let's repeat the previous exercise and get the maximum and minimum temperatures on the 5th of the month. First, we need to find the row where the 5th day of the month is, by taking the first column of the 2D array above and finding the index of the 5th element.

In [12]:
first_column = data[:,0]
idx = np.where(first_column == 5)[0][0]
max_temp = data[idx,1]
min_temp = data[idx,2]
print(f"Max: {max_temp}, Min: {min_temp}")

Max: 40.0, Min: 22.0


## Method 3: Read CSV with Pandas

Pandas is a highly useful Python package with many methods to analyze data. Pandas stores data as `Dataframes`, which contain headers and rows unlike Numpy arrays.

Let's read the original temperature file into Pandas:

In [13]:
df = pd.read_csv("temperature.csv",delimiter=',')

Now let's look at what `data` contains:

In [14]:
print(df)

    Day  Max  Min
0     1   57   32
1     2   65   33
2     3   74   52
3     4   76   38
4     5   40   22
5     6   28   12
6     7   26    8
7     8   31   -2
8     9   35    2
9    10   28   12
10   11   26    5
11   12   22   -6
12   13   31   -1
13   14   36    9
14   15   49   22
15   16   55   32
16   17   53   48
17   18   56   27
18   19   39   22
19   20   35   24
20   21   41   26
21   22   38   28
22   23   33   25
23   24   35   22
24   25   44   19
25   26   52   29
26   27   61   41
27   28   64   52


Once again, we'll retrieve the max and min temperature on the 5th day of the month.

In [15]:
row = df.loc[df['Day'] == 5]
max_temp = row['Max'].values[0]
min_temp = row['Min'].values[0]
print(f"Max: {max_temp}, Min: {min_temp}")

Max: 40, Min: 22


We've now read in CSV files using 3 different methods! Refer to other scripts in this directory for ways to analyze this data now that we've read it in.