## Statistical analysis on the bacteria colonies

Start by reading the the .csv file from the previous step.

Print the first few lines to see some of the data.

In [None]:
import pandas as pd

df = pd.read_csv("../data/bacteria_results_total.csv")
# Summary of all numeric columns
print(df.head())

### Caclulate the mean area of the different colonies

Now, we have seen how to use the ```mean()``` function on a data frame. What do you get if you calculate the mean of the area now?

In [None]:
# Summary of all numeric columns
print(df['area'].mean())



This is probably not what we want.
Right now we would like the area of each bacteria type and drug type.

We can do that by using a function called ```groupby(...)``` and give the column we want to group by.

Try with grouping by "bacteria" and see what happens. 


In [None]:
# group by bacteria and calculate the mean area for each bacteria type
df.groupby("bacteria")["area"].mean()

We are close now :)

It turns out ```groupby()``` is really nice. It lets us give an array of column names to group by. Which feels like what we want.

To create an array 'on the fly' use square brackets [ ] around your strings.

```groupy['column1','column2'])```

In [None]:
#gruop by bacteria and drug and calculate the mean area for each combination
df.groupby(["bacteria","drug"])["area"].mean()

HEUREKA!!

So group by gives us a nice way to compile statistics according to several variables.

We can also plot the values. For example, we can make histograms of the data we are interested in

In [None]:
import matplotlib.pyplot as plt

data_area = df.groupby(["bacteria","drug"])["area"]

# Histogram of Math scores
data_area.plot(kind="hist", bins=30, title="Area Distribution")
plt.legend()
plt.show()

data_feret = df.groupby(["bacteria","drug"])["feret_diameter_max"]
data_feret.plot(kind="hist", bins=30, title="Feret Diameter Distribution")
plt.legend()
plt.show()



The function ```agg()``` allows us to aggregate data in convenient ways in its own data frame.

In [None]:
agg_stats = df.groupby(["drug", "bacteria"]).agg({
    "area": ["mean", "std", "min", "max"]
})
print(agg_stats)

We can even aggregate data for more variables.

In [None]:
# Multiple aggregations
agg_stats = df.groupby(["bacteria","drug"]).agg({
    "feret_diameter_max": ["mean", "std", "min", "max"],
    "area": ["mean", "std"]
})
print(agg_stats)


### Pivot table
Another way to organise the data is with a pivot table. Pandas can of course do this as well with a function called ```pivot_table```

The parameters to the function, in this case, are:

```values```: the value we are interested in

```index```: the variable that will be the rows in the table

```columns```: the variable that will be the columns in the table

```aggfunc```: the functions we want to use to calculate/manipulat the values in the table

In [None]:
pivot = df.pivot_table(
    values=["area"],
    columns=["drug"],
    index=["bacteria"],
    aggfunc=["mean"]
)
print(pivot)

In [None]:
# Scatter plot to see relation between Math and Science scores
df.plot(kind="scatter", x="drug", y="area", alpha=0.7)
plt.title("Math vs Science Scores")
plt.show()