In this lab, we are going to practice *iterating* over our data. This is where we look at the data one bit at a time. We're going to look at three different files. For each file, we're going to look at the data one year at a time. And then we'll make a figure.

Let's get started!

In [None]:
from text_analytics import text_analytics
import os
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt


ai = text_analytics()
print("Done!")

Now we have three sets of articles to look at, from *Business Insider* and *Politico* and *TechCrunch*. So here we define a list of filenames, and then we open each file in the list.

Once we've opened each file, we use groupby to look at each year on its own.

In [None]:
files = ["Wordclouds.Business_Insider.gz", "Wordclouds.Politico.gz", "Wordclouds.TechCrunch.gz"]
for file in files:
    name = file
    file = os.path.join(ai.data_dir, file)
    df = pd.read_csv(file, index_col = 0)

    for year, year_df in df.groupby("Year"):
        print(name, year, len(year_df))
        
print("Done!")

It takes a few minutes to read through each of these files (most of that time is spent loading them into memory). But this code lets us iterate through a year-by-year census. Now, let's do it again, but this time we'll save the results and make a figure.

In [None]:
files = ["Wordclouds.Business_Insider.gz", "Wordclouds.Politico.gz", "Wordclouds.TechCrunch.gz"]
counts = []
for file in files:
    name = file.replace("Wordclouds.","").replace(".gz","")
    file = os.path.join(ai.data_dir, file)
    df = pd.read_csv(file, index_col = 0)

    for year, year_df in df.groupby("Year"):
        counts.append([name, year, len(year_df)])
        
counts = pd.DataFrame(counts, columns = ["Dataset", "Year", "N. Articles"])
print(counts)
print("Done!")

So that table gives us the results. And now we just plot it. This time we'll use the *seaborn* package rather than the native *pandas* plotting.

In [None]:
sns.barplot(x = "Dataset", y = "N. Articles", hue = "Year", data = counts)
plt.show()
print("Done!")

And that's all for this lab! We've seen that we can iterate over files and categories in order to survey our data.