**Programming with Python- Day1**

Carpentries Software workshop. **University of Twente**. November 14, 2024.

Adapted by **Dr. Rosa Aguilar**, from the software carpentry **Programming with Python** material

### Accesing multiple files

Sometimes we need to process several files. The library *glob* provides us a way to do so.

The *glob* library contains a function, also called glob, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character <b>* </b>matches zero or more characters, while <b>?</b> matches any one character. We can use this to get the names of all the CSV files in the current directory:
```
import glob
print(glob.glob('inflammation*.csv')
```

In [None]:
# write the code here to print all the inflammation datasets
import glob
print(glob.glob('./swc-python/data/inflammation*.csv'))

As these examples show, glob.glob’s result is a list of file and directory paths in arbitrary order. This means we can loop over it to do something with each filename in turn. In our case, the “something” we want to do is generate a set of plots for each file in our inflammation dataset.

If we want to start by analyzing just the first three files in alphabetical order, we can use the sorted built-in function to generate a new sorted list from the glob.glob output:

```
import glob
import numpy
import matplotlib.pyplot

filenames = sorted(glob.glob('inflammation*.csv'))
filenames = filenames[0:3]
for filename in filenames:
    print(filename)

    data = numpy.loadtxt(fname=filename, delimiter=',')

    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    axes1 = fig.add_subplot(1, 3, 1)
    axes2 = fig.add_subplot(1, 3, 2)
    axes3 = fig.add_subplot(1, 3, 3)

    axes1.set_ylabel('average')
    axes1.plot(numpy.mean(data, axis=0))

    axes2.set_ylabel('max')
    axes2.plot(numpy.amax(data, axis=0))

    axes3.set_ylabel('min')
    axes3.plot(numpy.amin(data, axis=0))

    fig.tight_layout()
    matplotlib.pyplot.show()
```

In [None]:
# write and execute the code here 

import numpy
import matplotlib.pyplot

filenames = sorted(glob.glob('./swc-python/data/inflammation*.csv'))
# filenames = filenames[0:3]
for filename in filenames:
    print(filename)

    data = numpy.loadtxt(fname=filename, delimiter=',')

    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    axes1 = fig.add_subplot(1, 3, 1)
    axes2 = fig.add_subplot(1, 3, 2)
    axes3 = fig.add_subplot(1, 3, 3)

    axes1.set_ylabel('average')
    axes1.plot(numpy.mean(data, axis=0))

    axes2.set_ylabel('max')
    axes2.plot(numpy.amax(data, axis=0))

    axes3.set_ylabel('min')
    axes3.plot(numpy.amin(data, axis=0))

    fig.tight_layout()
    matplotlib.pyplot.show()

What can be said about the third dataset?

Let's plot a heatmap of the third dataset

In [None]:
# write here the code to read the file, create the plot, and display it.


**Insights**

We can see that there are zero values sporadically distributed across all patients and days of the clinical trial, suggesting that there were potential issues with data collection throughout the trial. In addition, we can see that the last patient in the study didn’t have any inflammation flare-ups at all throughout the trial, suggesting that they may not even suffer from arthritis!



**Exercise - Plotting Differences**

Plot the difference between the average inflammations reported in the first and second datasets (stored in inflammation-01.csv and inflammation-02.csv, correspondingly), i.e., the difference between the leftmost plots of the first two figures.


In [None]:
# write your code here


**Exercise - optional**

Use each of the files once to generate a dataset containing values averaged over all patients by completing the code inside the loop given below:
```
filenames = glob.glob('inflammation*.csv')
composite_data = numpy.zeros((60, 40))
for filename in filenames:
    # sum each new file's data into composite_data as it's read
    #
# and then divide the composite_data by number of samples
composite_data = composite_data / len(filenames)
```

In [None]:
# write the code here



**Insights**

After spending some time investigating the heat map and statistical plots, as well as doing the above exercises to plot differences between datasets and to generate composite patient statistics, we gain some insight into the clinical trial dataset.

In fact, it appears that all three of the “noisy” datasets (inflammation-03.csv, inflammation-08.csv, and inflammation-11.csv) are identical down to the last value. We confront the author about the suspicious data and duplicated files. The author has admitted to fabricating the clinical data for their drug trial. <br>

**Key points**
<ul>
    <li>
      Use glob.glob(pattern) to create a list of files whose names match a pattern.  
    </li>
    <li>
        Use * in a pattern to match zero or more characters, and ? to match any single character.
    </li>
</ul>


