# 5-1: Pandas

No, not the useless overgrown rodents. [Pandas](https://pandas.pydata.org/) is an incredibly powerful data science library that allows deep statistical analysis on datasets through an easy-to-use API.

The simplest way I can explain it is this: imagine if Excel had a command line interface.

## Tabular Data

The reason I chose Excel as our metaphor is that Pandas deals with **tabular data**—data represented in rows and columns, like a spreadsheet. It comes packed with tools to parse, filter, summarize, and analyze these rows and columns, far beyond what would be easily accomplishable in a GUI.

To get started, we will need ourselves a dataset. I mentioned in the last lesson that Wappalyzer has a Python module. Let's use it to analyze some sites and collect the data!

In [14]:
# Import our stuff
# Pandas is conventionally named pd
import pandas as pd
from Wappalyzer import Wappalyzer, WebPage
wappalyzer = Wappalyzer.latest()

I'll use the list of sites from the last lesson, but feel free to modify it! It won't affect the run of this Notebook.

In [12]:
# Our sites to analyze
sites = [
    "taggart-tech.com",
    "shop.bbc.com",
    "huffpost.com",
    "discord.com",
    "soundcloud.com",
    "canva.com"
]

And now, we will iterate over the sites, using Wappalyzer to generate data for us. We'll save our results in `sites_data`.

In [13]:
sites_data = {}
for s in sites:
    # Create a WebPage object first, then analyze it
    webpage = WebPage.new_from_url(f"https://{s}")
    # Assign the result to the URL key
    sites_data[s] = wappalyzer.analyze_with_versions_and_categories(webpage)
    

Now let's understand the shape of our results. I'll review my site as an example.

In [16]:
sites_data["taggart-tech.com"]

{'Nginx': {'versions': ['1.17.0'],
  'categories': ['Web servers', 'Reverse proxies']},
 'Google Font API': {'versions': [], 'categories': ['Font scripts']}}

So it looks like every technology is its own key, with a dict containing `categories` and `versions`.

Unfortunately, this isn't a very nice structure for use with Pandas. If you imagine what our rows/columns would be for this dataset, you might immediately think that each site should be a row. But then, are the technologies columns? What do the cells contain then? The entire dict? That's no good; we need to pull this part a bit more for usability.

**Designing your data structure** is a critical part of using Jupyter/Pandas, because how you structure data will determine how easily you can manipulate it later.

For this set, I might propose the following structure:

Technology | Site | Categories | Versions
--|--|--|--
Nginx | taggart-tech.com | 'Web Servers, Reverse Proxies' | '1.17.0' 

So really, each technology is a row. We can use Pandas to isolate the tech from specific site(s) when necessary.

You might notice the quotes around the `Categories` and `Versions` values—and that they're plural. This is a choice I'm making based on usability. Do I need a separate row for each category a technology is in? Not really. And there will almost always be only one version of a technology present. So instead of making new rows by splitting these lists up, I'll be making them a joint string with `",".join()`.

That's not always the right call, but it makes sense for the shape of this data.

Let's process `sites_data` into a brand new list of `dict`s called `dataframe_data`. A `DataFrame` is Pandas's top-level object, and what we're trying to easily create here. There are a lot of ways to create DataFrames, but a common one is to simply convert a list of same-shape `dict`s. That's what we'll do here.

This has the added benefit of keeping the raw data around if we want it.

In [28]:
# Create empty list for new data
dataframe_data = []

# Loop through our data and create dicts based on our new structure
for s in sites_data:
    site = sites_data[s]
    for t in site:
        technology = site[t]
        restructured_data = {
            "technology": t,
            "site": s,
            "categories": ", ".join(technology["categories"]),
            "versions": ", ".join(technology["versions"])
        }
        dataframe_data.append(restructured_data)


And now, we simply call the `DataFrame` constructor with our list!

Conventionally, primary `DataFrame`s are named `df`, but don't have to be.

In [29]:
# Create our new DataFrame
df = pd.DataFrame(dataframe_data)

In [30]:
# Look at the top of it with the built-in `.head()` method
df.head()

Unnamed: 0,technology,site,categories,versions
0,Nginx,taggart-tech.com,"Web servers, Reverse proxies",1.17.0
1,Google Font API,taggart-tech.com,Font scripts,
2,jQuery,shop.bbc.com,JavaScript libraries,3.4.1
3,Apple Pay,shop.bbc.com,Payment processors,
4,Venmo,shop.bbc.com,Payment processors,


Now we won't go over the details of `DataFrame` manipulation, but let's say we wanted to find all the Payment processors. We could try something like:

In [31]:
df.query("'Payment processors' in categories")

Unnamed: 0,technology,site,categories,versions
3,Apple Pay,shop.bbc.com,Payment processors,
4,Venmo,shop.bbc.com,Payment processors,
6,American Express,shop.bbc.com,Payment processors,
9,Amazon Pay,shop.bbc.com,Payment processors,
10,Shop Pay,shop.bbc.com,Payment processors,
11,Google Pay,shop.bbc.com,Payment processors,
12,Visa,shop.bbc.com,Payment processors,


Or everything from a specific site?

In [32]:
df.query("site == 'taggart-tech.com'")

Unnamed: 0,technology,site,categories,versions
0,Nginx,taggart-tech.com,"Web servers, Reverse proxies",1.17.0
1,Google Font API,taggart-tech.com,Font scripts,


There is so much to explore in DataFrames, and we're going to dive in for real in the next lesson. This was meant as a quick intro to the concept of using Pandas and how to consider shaping our data to be useful in tables. 

See you in the next lesson where we'll get serious about Pandas!