<h1 style="text-align: center;">Data Software Names and Categories</h1>
<h3 style="text-align: center;">Includes a walkthrough and htm to practice on</hh3>

While many LLMs produce great results, product names can be a challenge.  Try these.

### Install the libraries into your environment using magic!

In [2]:
# An environment is basically a copy of python that should be created for each project.  Typically in a folder called .venv or .conda
# This environment(copy) is where everything should be installed.  Environments are disposable, easy to make, and isolated.  
# So no worries when trying pip install really_bad_library_that_breaks_stuff or seeing how your code works with different versions
# About magic, depending on the platform, characters like !,%, and %%(multi-line %) allow you to do a ton of things quickly and easily.
# Below, the pip command that would normally be entered into a terminal, has the same effect when used with magic in a notebook

# Good practice to comment these out once they are installed for a consistent environment and easier execution
%pip install pandas
%pip install bs4
%pip install regex

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Import libraries so they can be used

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import re
import json
# These options adjust the viewing size of the pandas df
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


#### 400 pages were scraped from g2.com (using a separate python script).  They really do not like sharing so it took awhile.

In [4]:
## read json lines to pd dfs

# obj1 =pd.read_json(path_or_buf="g2_html/job-3409143-result.jsonl", lines=True)
# obj2 =pd.read_json(path_or_buf="g2_html/job-3409798-result.jsonl", lines=True)
# obj3 =pd.read_json(path_or_buf="g2_html/job-3411436-result.jsonl", lines=True)
# obj4 =pd.read_json(path_or_buf="g2_html/job-3427118-result.jsonl", lines=True)
# obj5 =pd.read_json(path_or_buf="g2_html/job-3462959-result.jsonl", lines=True)

# Original code commented out.  Use below for the included file
obj1 =pd.read_json(path_or_buf="g2_html.jsonl", lines=True)

# dfG = pd.concat([obj1,obj2,obj3,obj4,obj5], ignore_index= True)
#above concat not needed since we are only using one file.  Below line keeps naming correct.

dfG = obj1

# Scraping the website was a bit challenging (CloudFlare etc) and took several attempts
# producing a lot of non-sequintial results.  This function kept track of the html to avoid multiple downloads.
def get_page_numbers(dfG):
    """Extract page numbers that have been scraped using regex"""
    pgNumList= []
    for i in range(0, len(dfG)):
        rx = re.compile(r'\b\d+\b')
        f = rx.findall(dfG["input"][i])
        pgNumList.append(int(f[0]))
    dfG["page_number"] = pgNumList
    dfG.sort_values("page_number", inplace = True, ignore_index= True)
    dfG.drop_duplicates(subset= ["page_number"], inplace = True, ignore_index=True)
    
dfG.head()

Unnamed: 0,input,result
0,https://www.g2.com/search?order=popular&amp;pa...,"<!doctype html>\n<html class="" cors history js..."
1,https://www.g2.com/search?order=popular&amp;pa...,"<!doctype html>\n<html class="" cors history js..."
2,https://www.g2.com/search?order=popular&amp;pa...,"<!doctype html>\n<html class="" cors history js..."
3,https://www.g2.com/search?order=popular&amp;pa...,"<!doctype html>\n<html class="" cors history js..."
4,https://www.g2.com/search?order=popular&amp;pa...,"<!doctype html>\n<html class="" cors history js..."


#### Begin building a scraper.  After looking at the html, looks like all of the data I want is in the \<a> tags

In [5]:
## create df w new index that includes all of the json files

def a_lister_list(dfG):
    """Uses Beautiful Soup to extract <a> tags into a list from each page"""
    aList = []
    for i in range(0,len(dfG)):
        reOne = dfG["result"][i] 
        aSoup = BeautifulSoup(str(reOne), "html.parser")
        aHunt = aSoup.find_all("a")
        aList.append(aHunt)
    return aList



#### Within the \<a> tags, the information I want is stored as json. After pulling out the json using regex, I convert json to python dictionaries which makes them easy to work with

In [8]:
# Rather than write one long function, make shortter functions that are the combined.
# Makes it much easier to read, troubleshoot, and change

def json_extract(input_text):
    "extract json from the html of a single a tag aka a_team_solo"
    pattern = re.compile(r"\{.*\}")
    return pattern.findall(repr(input_text))

# a_team_solo = a single element from the list of extracted a tag html
def a_dicts (a_team_solo):
    """Uses json_extract to get the json and catch errors.  Then convert to python  """
    j_list = []
    for i in range(0,len(a_team_solo)):
        try:
            js = json.loads(json_extract(str(a_team_solo[i]))[0]) 
            j_list.append(js)
        except IndexError:
            pass
        except KeyError:
            pass
        except json.JSONDecodeError as e:
            # I use a lot of print statemnts during development and testing.  
            # They output without needing a return statement and are easy to put in / take out
            # They help show where a script is failing and/or what is happening
            # For more in depth projects, a logging library is a better choice
            print(f"{i} Stupid json.  I hope this isn't a problem: {e}")
    return j_list


def cats(a_team_solo):
    """Uses a_dicts to compile category data into a pandas dataframe """
    dList = a_dicts(a_team_solo)
    catList = []
    for i in range(0,len(dList)):
        try:
            cat = dList[i]["category"]
            catList.append(cat)
        except KeyError:
            pass
    catDF = pd.DataFrame({"category" : catList})
    return catDF


def sfts(a_team_solo):
    """Uses a_dicts to compile software and associated category data into a pandas dataframe """
    dList = a_dicts(a_team_solo)
    sftList = []
    catList = []
    for i in range(0,len(dList)):
        try:
            sft = dList[i]["product"]
            cat = dList[i]["category"]
            catList.append(cat)
            sftList.append(sft)
        except KeyError:
            pass
    sftDF = pd.DataFrame({"software" : sftList, "category" : catList})
    return sftDF

#### Put it all together and start cooking. We end up with a nice pot of food that looks and tastes delicious.

In [15]:
aTeam = a_lister_list(dfG)

def getCategoriesDF(aTeam):
    for i in range(0, len(aTeam)):
        aSolo = aTeam[i]
        catList = cats(aSolo)
    return catList

getCategoriesDF(aTeam)       
def getSoftwareDF(aTeam):
    for i in range(0, len(aTeam)):
        aSolo = aTeam[i]
        sftList = sfts(aSolo)
    return sftList

df_s = getSoftwareDF(aTeam)
df_c = getCategoriesDF(aTeam)

# A few errors indicated below to check on.  Spoiler Alert!  Around 80 errors is low 
# and they do not involve data we want
    

404 Stupid json.  I hope this isn't a problem: Extra data: line 1 column 189 (char 188)
464 Stupid json.  I hope this isn't a problem: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
465 Stupid json.  I hope this isn't a problem: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
466 Stupid json.  I hope this isn't a problem: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
467 Stupid json.  I hope this isn't a problem: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
468 Stupid json.  I hope this isn't a problem: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
469 Stupid json.  I hope this isn't a problem: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
470 Stupid json.  I hope this isn't a problem: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
471 Stupid json.  I hope this isn't a problem: Expe

#### Remove duplicates and check everything over

In [16]:
def df_check (df):
    """Clean up duplicates, make sure everythin is in order, see what we have"""
    df = df.drop_duplicates(ignore_index=True)
    df = df.reindex()
    print(df.describe())
    return df

dfC = df_check(df_c)

dfS = df_check(df_s)


            category
count            336
unique           336
top     CRM Software
freq               1
        software                                category
count         51                                      51
unique        20                                      51
top     Datavail  Police Records Management System (RMS)
freq          22                                       1


#### Format df so that the Software DF shows a list of apllicable categories for each row

In [18]:
dfSGrouped = dfS.groupby('software')['category'].apply(list).reset_index(name='categories')
# check to see if there are unused catagories.  There are, will keep catagories as a separate file.
len(set(dfC['category']) - set(dfS['category']))

285

#### Save as csv and share with others

In [None]:
sftWords = dfSGrouped.to_csv("software_names.csv")
catWords = dfC.to_csv("software_categories.csv")

# The g2_html shared here is great for practice but only yields a small sub-set
# categoryWords.csv and softwareWords.csv include the result from the all html
 

#### One more thing!

In [20]:
# A bit of magic for a list of all the libraries used.  This would normally go in a requirements.txt 
# file and represent the versions used and tested.  Although we only imported 3 packages, each of those imports
and uses packages giving us the list below
%pip freeze

appnope==0.1.4
asttokens==2.4.1
beautifulsoup4==4.12.3
bs4==0.0.2
comm==0.2.1
debugpy==1.8.1
decorator==5.1.1
executing==2.0.1
ipykernel==6.29.2
ipython==8.21.0
jedi==0.19.1
jupyter_client==8.6.0
jupyter_core==5.7.1
matplotlib-inline==0.1.6
nest-asyncio==1.6.0
numpy==1.26.4
packaging==23.2
pandas==2.2.0
parso==0.8.3
pexpect==4.9.0
platformdirs==4.2.0
prompt-toolkit==3.0.43
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
Pygments==2.17.2
python-dateutil==2.8.2
pytz==2024.1
pyzmq==25.1.2
regex==2023.12.25
six==1.16.0
soupsieve==2.5
stack-data==0.6.3
tornado==6.4
traitlets==5.14.1
tzdata==2024.1
wcwidth==0.2.13
Note: you may need to restart the kernel to use updated packages.
