# EDA of Infogroup Datasets
#### This notebook reads in the raw infogroup datasets, cleans them, and outputs a master dataframe that contains years 1997-2022. A brief analysis follows

In [13]:
import pandas as pd
import numpy as np
from pipeline.clean import clean_infogroup
from pipeline.constants import RAW_INFOGROUP_FPATH, CLEANED_INFOGROUP_FPATH, \
                                INFOGROUP_2022

ModuleNotFoundError: No module named 'pipeline'

In [3]:
info_2022 = pd.read_csv(INFOGROUP_2022)
info_2022.info()

This dataframe has 720 rows
The column data types of the variables in this DF are
 Unnamed: 0                        int64
COMPANY                          object
ADDRESS LINE 1                   object
CITY                             object
STATE                            object
ZIPCODE                         float64
ZIP4                            float64
COUNTY CODE                     float64
AREA CODE                         int64
IDCODE                            int64
LOCATION EMPLOYEE SIZE CODE      object
LOCATION SALES VOLUME CODE       object
PRIMARY SIC CODE                  int64
SIC6_DESCRIPTIONS                object
PRIMARY NAICS CODE              float64
NAICS8 DESCRIPTIONS              object
SIC CODE                        float64
SIC6_DESCRIPTIONS (SIC)          object
SIC CODE 1                      float64
SIC6_DESCRIPTIONS (SIC1)         object
SIC CODE 2                      float64
SIC6_DESCRIPTIONS(SIC2)          object
SIC CODE 3                      float

## To clean the infogroup dataset, we:
#### Made a large dataframe that contained each individual year's dataset, from 1997 to 2022, standardized column names to be all uppercase, turned NaNs to 0s, turned float64 categorical variables (such as YEAR ESTABLISHED) into int64s, and added a column for parent names using the parent company ABIs

In [3]:
clean_infogroup(RAW_INFOGROUP_FPATH, '2015', filter)
df = pd.read_csv(CLEANED_INFOGROUP_FPATH, index_col=0)
df.head(5)

Unnamed: 0,COMPANY,ADDRESS LINE 1,CITY,STATE,ZIPCODE,PRIMARY SIC CODE,ARCHIVE VERSION YEAR,YEAR ESTABLISHED,ABI,SALES VOLUME (9) - LOCATION,COMPANY HOLDING STATUS,PARENT NUMBER,PARENT NAME,LATITUDE,LONGITUDE,YEAR 1ST APPEARED
0,TYSON FOODS INC,5421 W BEAVER ST,JACKSONVILLE,FL,32254.0,201501.0,1997,0,456474451,116100.0,0.0,7537913,Tyson Foods Inc,30.326801,-81.741302,0
1,PECO FOODS INC,3701 KAULOOSA AVE,TUSCALOOSA,AL,35401.0,201501.0,1997,0,166231,140000.0,0.0,0,,33.178114,-87.56144,0
2,GOLD KIST INC,1001 HIGHWAY 78 BYP W,JASPER,AL,35501.0,201501.0,1997,0,133681577,2085.0,0.0,1941509,UNKNOWN,33.848245,-87.289162,0
3,KING FOODS,641 HOLLY ST NE,DECATUR,AL,35601.0,201501.0,1997,0,886923978,7740.0,0.0,0,,34.604495,-86.978356,0
4,WAYNE FARMS,IPSCO ST,DECATUR,AL,35601.0,201501.0,1997,0,635771,129000.0,0.0,0,,34.593707,-86.996174,0


## Brief Analysis & Exploration

In [4]:
unique_buisnesses = df.drop_duplicates(subset=['ABI'])
print("There are", len(unique_buisnesses), "businesses with unique ABI codes within this dataset")

There are 2166 businesses with unique ABI codes within this dataset


In [5]:
# What states have the most unique processing plants?
unique_buisnesses.groupby('STATE').count()['COMPANY'].sort_values(ascending=False)

STATE
AR    208
GA    176
TX    169
AL    136
NC    126
CA    113
MS     96
MN     89
MO     72
OH     70
PA     64
IA     54
VA     51
NY     51
OK     49
IN     46
SC     41
TN     40
IL     39
WI     38
FL     35
NE     35
MI     34
MD     33
LA     32
KY     31
CO     26
DE     25
NJ     24
KS     19
SD     19
MA     15
WA     14
ME     13
UT     12
OR     11
HI     10
NH      8
ID      8
WV      6
RI      5
CT      5
AZ      5
ND      4
DC      2
VT      2
AK      2
NV      1
NM      1
MT      1
Name: COMPANY, dtype: int64

### Using ABI to track businesses across the years

##### The ABI is the unique number assigned to each business in the infogroup database. A business's ABI number will not vary by year. Because of this, we can use the ABI number to perform record linkage on the infogroup data, from 1997 to 2022. 