## Introduction

This notebook supports the SQL analysis project carried out in MySQL by doing the following:

* Load and Transform the PDP database data into a format that can be imported into MySQL from the native .txt and .xlsx formats.

* Add the header column names retrieved from the PDF data dictionary file.

* Convert the normalisation reference tables from excel to csv for loading into SQL.

* Create control code to confirm that query generate in SQL are correct

* Save the queries in a dataframe format and stored in a table in the database.

In [1]:
# Import the pandas library
import pandas as pd

In [2]:
samples_df = pd.read_csv("data\PDP21Samples.txt", sep="|", names=[0,1, 2, 3, 4, 5, 6, 7, 8,
                                                           9, 10, 11, 12, 13, 14, 15,17,18], index_col=False)

# samples_df.drop(0, axis=1,inplace =True) # to ensure the index starts at zero
samples_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18
0,1,CA,21,3,8,43,BB,,,2,595,T,FR,NC,,,,IL
1,2,CA,21,3,8,97,BB,,Na,2,275,T,FR,NC,,,,
2,3,CA,21,3,8,149,BB,,Na,1,,D,FR,PO,,CA,,CA
3,4,CA,21,3,8,230,BB,,,1,,T,FR,PO,,,,CA
4,5,CA,21,3,8,268,BB,,,2,275,T,FR,NC,,,,CA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,10123,WA,21,11,8,38,WS,,Butternut,1,,D,FR,NC,2.0,,,OR
10123,10124,WA,21,12,13,2,WS,,Acorn,1,,D,FR,NC,3.0,,,WA
10124,10125,WA,21,12,13,10,WS,,Butternut,1,,D,FR,NC,2.0,,,WA
10125,10126,WA,21,12,13,31,WS,,Butternut,2,595,D,FR,NC,2.0,,,CA


In [3]:
# Load the results.txt file into the notebook

results_df = pd.read_csv("data\PDP21Results.txt", sep="|", names=[1, 2, 3, 4, 5, 6, 7, 8,
                                                           9, 10, 11, 12, 13, 14, 15,16], index_col=False)
results_df

  results_df = pd.read_csv("data\PDP21Results.txt", sep="|", names=[1, 2, 3, 4, 5, 6, 7, 8,


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,1,BB,FR,WA1,001,A,,0.0030,M,,,,,ND,805,35
1,1,BB,FR,WA1,024,C,,0.0050,M,,,,,ND,805,64
2,1,BB,FR,WA1,028,A,,0.0100,M,,,,,ND,805,35
3,1,BB,FR,WA1,032,A,,0.0015,M,,,,,ND,805,64
4,1,BB,FR,WA1,034,A,,0.0100,M,,,,,ND,805,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2737928,10127,WS,FR,CA1,F56,A,,0.0013,M,,,,,ND,805,52
2737929,10127,WS,FR,CA1,F89,A,,0.0025,M,,,,,ND,805,52
2737930,10127,WS,FR,CA1,F94,A,,0.0200,M,,,,,ND,805,52
2737931,10127,WS,FR,CA1,G00,A,,0.0013,M,,,,,ND,805,52


In [4]:
# Create a list for the results table column headers retrieved from PDF file
samples_header = ['SAMPLE_PK','STATE', 'YEAR', 'MONTH', 'DAY', 'SITE', 'COMMOD',
                 'SOURCE_ID', 'VARIETY', 'ORIGIN', 'COUNTRY', 'DISTTYPE', 'COMMTYPE',
                 'CLAIM', 'QUANTITY', 'GROWST', 'PACKST', 'DISTST']

In [5]:
# Create a list for the samples table column headers retrieved from PDF file
results_header = ['SAMPLE_PK','COMMOD', 'COMMTYPE', 'LAB', 'PESTCODE', 'TESTCLASS', 'CONCEN', 'LOD', 'CONUNIT', 'CONFMETHOD', 'CONFMETHOD2',
                 'ANNOTATE', 'QUANTITATE', 'MEAN', 'EXTRACT', 'DETERMIN']

In [6]:
results_df.columns = results_header

In [7]:
results_df

Unnamed: 0,SAMPLE_PK,COMMOD,COMMTYPE,LAB,PESTCODE,TESTCLASS,CONCEN,LOD,CONUNIT,CONFMETHOD,CONFMETHOD2,ANNOTATE,QUANTITATE,MEAN,EXTRACT,DETERMIN
0,1,BB,FR,WA1,001,A,,0.0030,M,,,,,ND,805,35
1,1,BB,FR,WA1,024,C,,0.0050,M,,,,,ND,805,64
2,1,BB,FR,WA1,028,A,,0.0100,M,,,,,ND,805,35
3,1,BB,FR,WA1,032,A,,0.0015,M,,,,,ND,805,64
4,1,BB,FR,WA1,034,A,,0.0100,M,,,,,ND,805,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2737928,10127,WS,FR,CA1,F56,A,,0.0013,M,,,,,ND,805,52
2737929,10127,WS,FR,CA1,F89,A,,0.0025,M,,,,,ND,805,52
2737930,10127,WS,FR,CA1,F94,A,,0.0200,M,,,,,ND,805,52
2737931,10127,WS,FR,CA1,G00,A,,0.0013,M,,,,,ND,805,52


In [8]:
samples_df.columns = samples_header

In [9]:
samples_df

Unnamed: 0,SAMPLE_PK,STATE,YEAR,MONTH,DAY,SITE,COMMOD,SOURCE_ID,VARIETY,ORIGIN,COUNTRY,DISTTYPE,COMMTYPE,CLAIM,QUANTITY,GROWST,PACKST,DISTST
0,1,CA,21,3,8,43,BB,,,2,595,T,FR,NC,,,,IL
1,2,CA,21,3,8,97,BB,,Na,2,275,T,FR,NC,,,,
2,3,CA,21,3,8,149,BB,,Na,1,,D,FR,PO,,CA,,CA
3,4,CA,21,3,8,230,BB,,,1,,T,FR,PO,,,,CA
4,5,CA,21,3,8,268,BB,,,2,275,T,FR,NC,,,,CA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,10123,WA,21,11,8,38,WS,,Butternut,1,,D,FR,NC,2.0,,,OR
10123,10124,WA,21,12,13,2,WS,,Acorn,1,,D,FR,NC,3.0,,,WA
10124,10125,WA,21,12,13,10,WS,,Butternut,1,,D,FR,NC,2.0,,,WA
10125,10126,WA,21,12,13,31,WS,,Butternut,2,595,D,FR,NC,2.0,,,CA


In [10]:
# Save the files to .csv for use in SQL
samples_df.to_csv('data\pdp_samples.csv', index=False)
results_df.to_csv('data\pdp_results.csv', index=False)

**Reference Tables**

We will uses Pandas to prepare the reference tables from the xlsx file created from the PDF in the PDP data depository. These tables will then be imported into MySQL

In [13]:
# import the tables
annotate_df = pd.read_excel("data\Ref Tables\Annotate Code_PDP ReferenceTables 2021-17.xlsx", skiprows=2)
annotate_df

Unnamed: 0,Annotate Code,Annotated Information
0,Q,Residue at below quantifiable level (BQL)
1,QV,Residue at <BQL> with presumptive violation - ...
2,V,Residue with a presumptive violation - No Tole...
3,X,Residue with a presumptive violation - Exceeds...


In [15]:
# Write the DataFrame to csv for upload into MySQL
annotate_df.to_csv('data\Annotate Code.csv', index=False)

In [47]:
# Create a help function to load the excel file and save to csv
def load_to_csv(file):
    df = pd.read_excel(file, skiprows=2)
    filename = file.split(sep=".")[0]
    df.to_csv(filename +'.csv', index=False)


In [50]:
load_to_csv("data\Ref Tables\Annotate Code_PDP ReferenceTables 2021-17.xlsx")
load_to_csv("data\Ref Tables\Commodity Code_PDP ReferenceTables 2021-15.xlsx")
load_to_csv("data\Ref Tables\Commodity Type Code_PDP ReferenceTables 2021-14.xlsx")
load_to_csv("data\Ref Tables\Concentration-LOD Unit Code_PDP ReferenceTables 2021-13.xlsx")
load_to_csv("data\Ref Tables\Confirmation Method Code_PDP ReferenceTables 2021-12.xlsx")
load_to_csv("data\Ref Tables\Country Code_PDP ReferenceTables 2021-11.xlsx")
load_to_csv("data\Ref Tables\Determinative Method Code_PDP ReferenceTables 2021-10.xlsx")
load_to_csv("data\Ref Tables\Distribution Type Code_PDP ReferenceTables 2021-9.xlsx")
load_to_csv("data\Ref Tables\Extract Code_PDP ReferenceTables 2021-8.xlsx")
load_to_csv("data\Ref Tables\Lab Code_PDP ReferenceTables 2021-7.xlsx")
load_to_csv("data\Ref Tables\Marketing Claim Code_PDP ReferenceTables 2021-16.xlsx")
load_to_csv("data\Ref Tables\Mean Code_PDP ReferenceTables 2021-6.xlsx")
load_to_csv("data\Ref Tables\Origin Code_PDP ReferenceTables 2021-5.xlsx")

In [51]:
load_to_csv("data\Ref Tables\Pest Code_PDP ReferenceTables 2021-4.xlsx")
load_to_csv("data\Ref Tables\Quantitation Code_PDP ReferenceTables 2021-3.xlsx")
load_to_csv("data\Ref Tables\State Code_PDP ReferenceTables 2021-2.xlsx")
load_to_csv("data\Ref Tables\Test Class Code_PDP ReferenceTables 2021.xlsx")

In [38]:
# Load the results data for further cleaning due to error while trying to load to MySQL
results = pd.read_csv("C:\\ProgramData\\MySQL\\MySQL Server 8.0\\Uploads\\pdp_results.csv")
results

  results = pd.read_csv("C:\\ProgramData\\MySQL\\MySQL Server 8.0\\Uploads\\pdp_results.csv")


Unnamed: 0,SAMPLE_PK,COMMOD,COMMTYPE,LAB,PESTCODE,TESTCLASS,CONCEN,LOD,CONUNIT,CONFMETHOD,CONFMETHOD2,ANNOTATE,QUANTITATE,MEAN,EXTRACT,DETERMIN
0,1,BB,FR,WA1,001,A,,0.0030,M,,,,,ND,805,35
1,1,BB,FR,WA1,024,C,,0.0050,M,,,,,ND,805,64
2,1,BB,FR,WA1,028,A,,0.0100,M,,,,,ND,805,35
3,1,BB,FR,WA1,032,A,,0.0015,M,,,,,ND,805,64
4,1,BB,FR,WA1,034,A,,0.0100,M,,,,,ND,805,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2737928,10127,WS,FR,CA1,F56,A,,0.0013,M,,,,,ND,805,52
2737929,10127,WS,FR,CA1,F89,A,,0.0025,M,,,,,ND,805,52
2737930,10127,WS,FR,CA1,F94,A,,0.0200,M,,,,,ND,805,52
2737931,10127,WS,FR,CA1,G00,A,,0.0013,M,,,,,ND,805,52


In [4]:
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2737933 entries, 0 to 2737932
Data columns (total 16 columns):
 #   Column       Dtype  
---  ------       -----  
 0   SAMPLE_PK    int64  
 1   COMMOD       object 
 2   COMMTYPE     object 
 3   LAB          object 
 4   PESTCODE     object 
 5   TESTCLASS    object 
 6   CONCEN       float64
 7   LOD          float64
 8   CONUNIT      object 
 9   CONFMETHOD   object 
 10  CONFMETHOD2  float64
 11  ANNOTATE     object 
 12  QUANTITATE   object 
 13  MEAN         object 
 14  EXTRACT      object 
 15  DETERMIN     int64  
dtypes: float64(3), int64(2), object(11)
memory usage: 334.2+ MB


In [6]:
# EXTRACT is showing as object instead of int
results['EXTRACT'] = results['EXTRACT'].astype(int)

ValueError: invalid literal for int() with base 10: 'P90'

In [43]:
# Use REGEX to strip any letters in the column
import re

# for row, item in enumerate(results['DETERMIN']):
#     item = re.sub("[^0-9]", "", str(item)) 
#     results.loc[row,'DETERMIN'] = int(item)

results['EXTRACT'] = results['EXTRACT'].apply(lambda x: re.sub("[^0-9]", "", str(x)))
results['EXTRACT'] = results['EXTRACT'].apply(lambda x: int(x))

In [44]:
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2737933 entries, 0 to 2737932
Data columns (total 16 columns):
 #   Column       Dtype  
---  ------       -----  
 0   SAMPLE_PK    int64  
 1   COMMOD       object 
 2   COMMTYPE     object 
 3   LAB          object 
 4   PESTCODE     object 
 5   TESTCLASS    object 
 6   CONCEN       float64
 7   LOD          float64
 8   CONUNIT      object 
 9   CONFMETHOD   object 
 10  CONFMETHOD2  float64
 11  ANNOTATE     object 
 12  QUANTITATE   object 
 13  MEAN         object 
 14  EXTRACT      int64  
 15  DETERMIN     int64  
dtypes: float64(3), int64(3), object(10)
memory usage: 334.2+ MB


In [46]:
results.to_csv('data/pdp_results_c.csv', index=False)

In [47]:
samples = pd.read_csv('data/pdp_samples.csv')
samples

Unnamed: 0,SAMPLE_PK,STATE,YEAR,MONTH,DAY,SITE,COMMOD,SOURCE_ID,VARIETY,ORIGIN,COUNTRY,DISTTYPE,COMMTYPE,CLAIM,QUANTITY,GROWST,PACKST,DISTST
0,1,CA,21,3,8,43,BB,,,2,595,T,FR,NC,,,,IL
1,2,CA,21,3,8,97,BB,,Na,2,275,T,FR,NC,,,,
2,3,CA,21,3,8,149,BB,,Na,1,,D,FR,PO,,CA,,CA
3,4,CA,21,3,8,230,BB,,,1,,T,FR,PO,,,,CA
4,5,CA,21,3,8,268,BB,,,2,275,T,FR,NC,,,,CA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,10123,WA,21,11,8,38,WS,,Butternut,1,,D,FR,NC,2.0,,,OR
10123,10124,WA,21,12,13,2,WS,,Acorn,1,,D,FR,NC,3.0,,,WA
10124,10125,WA,21,12,13,10,WS,,Butternut,1,,D,FR,NC,2.0,,,WA
10125,10126,WA,21,12,13,31,WS,,Butternut,2,595,D,FR,NC,2.0,,,CA


In [48]:
re = pd.read_csv('data/pdp_results_c.csv')

  re = pd.read_csv('data/pdp_results_c.csv')


In [49]:
re

Unnamed: 0,SAMPLE_PK,COMMOD,COMMTYPE,LAB,PESTCODE,TESTCLASS,CONCEN,LOD,CONUNIT,CONFMETHOD,CONFMETHOD2,ANNOTATE,QUANTITATE,MEAN,EXTRACT,DETERMIN
0,1,BB,FR,WA1,001,A,,0.0030,M,,,,,ND,805,35
1,1,BB,FR,WA1,024,C,,0.0050,M,,,,,ND,805,64
2,1,BB,FR,WA1,028,A,,0.0100,M,,,,,ND,805,35
3,1,BB,FR,WA1,032,A,,0.0015,M,,,,,ND,805,64
4,1,BB,FR,WA1,034,A,,0.0100,M,,,,,ND,805,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2737928,10127,WS,FR,CA1,F56,A,,0.0013,M,,,,,ND,805,52
2737929,10127,WS,FR,CA1,F89,A,,0.0025,M,,,,,ND,805,52
2737930,10127,WS,FR,CA1,F94,A,,0.0200,M,,,,,ND,805,52
2737931,10127,WS,FR,CA1,G00,A,,0.0013,M,,,,,ND,805,52


## SQL-Pandas Analysis

The next step would be to connect to the MySQL database and run the SQL queries using Pandas as well as doing same analysis with Pandas


In [42]:
# Install pymysql
# !pip install pymysql

In [59]:
# Import the needed python libraries
import pymysql
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text
import os
from dotenv import load_dotenv

In [220]:
# Instantiate the dotenv module
load_dotenv()

# Get the environment variable
password =os.environ.get('password')

connection_string = "mysql+pymysql://root:{password}@localhost/pdp_2021_db".format(password=password)

# Create connect with MySQL database using SQLAlchemy
engine = create_engine(connection_string, echo=False)

# Create sql magic connection
%sql $connection_string

In [62]:
# !pip install ipython-sql

In [221]:
# instantiate sql magic
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [64]:
# Connect the database 
with engine.connect() as con:
    sql = 'select * from pdp_samples'
    result = pd.read_sql(sql, con)
result.head()

Unnamed: 0,SAMPLE_PK,STATE,YEAR,MONTH,DAY,SITE,COMMOD,SOURCE_ID,VARIETY,ORIGIN,COUNTRY,DISTTYPE,COMMTYPE,CLAIM,QUANTITY,GROWST,PACKST,DISTST
0,1,CA,21,3,8,43,BB,,,2,595.0,T,FR,NC,,,,IL
1,2,CA,21,3,8,97,BB,,Na,2,275.0,T,FR,NC,,,,
2,3,CA,21,3,8,149,BB,,Na,1,,D,FR,PO,,CA,,CA
3,4,CA,21,3,8,230,BB,,,1,,T,FR,PO,,,,CA
4,5,CA,21,3,8,268,BB,,,2,275.0,T,FR,NC,,,,CA


In [68]:
# Define a helper function that will insert queries into the queries table
def store_query(query_title, sql_query):
    query_dict = {
                  'query_title': [query_title],
                  'sql_query': [sql_query]
                 }

    # put query into the query_dict
    queries = pd.DataFrame(query_dict)
    with engine.connect() as con:
    # load query into MySQL
        queries.to_sql('queries', con, if_exists='append', index=False)  

**Explore all the tables in the database**

In [90]:
%%sql 
# 1.annotate code
SELECT * 
FROM `annotate code`;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
8 rows affected.


Annotate Code,Annotated Information
Q,Residue at below quantifiable level (BQL)
QV,Residue at <BQL> with presumptive violation - No Tolerance
V,Residue with a presumptive violation - No Tolerance
X,Residue with a presumptive violation - Exceeds Tolerance
Q,Residue at below quantifiable level (BQL)
QV,Residue at <BQL> with presumptive violation - No Tolerance
V,Residue with a presumptive violation - No Tolerance
X,Residue with a presumptive violation - Exceeds Tolerance


In [225]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("annotate code", con)
result.head()

Unnamed: 0,Annotate Code,Annotated Information
0,Q,Residue at below quantifiable level (BQL)
1,QV,Residue at <BQL> with presumptive violation - ...
2,V,Residue with a presumptive violation - No Tole...
3,X,Residue with a presumptive violation - Exceeds...
4,Q,Residue at below quantifiable level (BQL)


In [70]:
# Save the query in database
sql = """
SELECT 
    *
FROM
    `annotate code`;
    """
store_query("explore table-1.annotate code", sql)

In [91]:
%%sql 
# 2. commodity_code table
SELECT 
    * 
FROM 
    commodity_type_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
4 rows affected.


Commod Type Code,Commodity Type
FR,Fresh
FZ,Frozen
GR,"Grain, Raw"
RE,Liquid Ready-to-Serve


In [226]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("commodity_type_code", con)
result.head()

Unnamed: 0,Commod Type Code,Commodity Type
0,FR,Fresh
1,FZ,Frozen
2,GR,"Grain, Raw"
3,RE,Liquid Ready-to-Serve


In [82]:
# Save the query in database
sql = """ 
SELECT * FROM commodity_type_code;
"""
store_query("explore table-2.commodity_type_code", sql)

In [92]:
%%sql
# 3. commodity_code table
SELECT 
    * 
FROM
    commodity_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
21 rows affected.


Commodity Code,Commodity Name,# of Samples Analyzed
BB,"Blueberries, Cultivated",692
BR,Broccoli,708
BU,Butter,177
BZ,"Blueberries, Frozen",14
CE,Celery,354
CF,Cauliflower,531
CN,Cantaloupe,328
CO,Corn Grain,418
CR,Carrots,708
EP,Eggplant,703


In [227]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("commodity_code", con)
result.head()

Unnamed: 0,Commodity Code,Commodity Name,# of Samples Analyzed
0,BB,"Blueberries, Cultivated",692
1,BR,Broccoli,708
2,BU,Butter,177
3,BZ,"Blueberries, Frozen",14
4,CE,Celery,354


In [93]:
# Save the query in database
sql = """ 
SELECT 
    * 
FROM
    commodity_code;
"""
store_query("explore table-3.commodity_code", sql)

In [94]:
%%sql
# 4. concentration-iod_unit_code table
SELECT 
    *
FROM
    `concentration-lod_unit_code`;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
3 rows affected.


Concen/LOD Unit Code,Concen/LOD Unit Description
B,Parts-per-Billion (ppb)
M,Parts-per-Million (ppm)
T,Parts-per-Trillion (ppt)


In [228]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("concentration-lod_unit_code", con)
result.head()

Unnamed: 0,Concen/LOD Unit Code,Concen/LOD Unit Description
0,B,Parts-per-Billion (ppb)
1,M,Parts-per-Million (ppm)
2,T,Parts-per-Trillion (ppt)


In [95]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    `concentration-lod_unit_code`;
"""
store_query("explore table-4.concentration-lod_unit_code", sql)

In [96]:
%%sql
# 5. confirmation_method_code table
SELECT 
    *
FROM
    confirmation_method_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
4 rows affected.


ConfMethod Code,Confirmation Method
GN,GC/MSD with Negative Chemical Ionization
GT,GC/MS/MS - triple quadropole
HR,GC or LC High Resolution MS
LU,LC-MS/MS - triple quadrapole


In [229]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("confirmation_method_code", con)
result.head()

Unnamed: 0,ConfMethod Code,Confirmation Method
0,GN,GC/MSD with Negative Chemical Ionization
1,GT,GC/MS/MS - triple quadropole
2,HR,GC or LC High Resolution MS
3,LU,LC-MS/MS - triple quadrapole


In [97]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    confirmation_method_code;
"""
store_query("explore table-5. confirmation_method_code", sql)

In [98]:
%%sql
# 6. country_code table
SELECT 
    *
FROM
    country_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
23 rows affected.


Country Code,Country Name
150,Argentina
160,Australia
260,Canada
275,Chile
280,China
295,Costa Rica
350,France
394,Germany
400,Greece
415,Guatemala


In [230]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("country_code", con)
result.head()

Unnamed: 0,Country Code,Country Name
0,150,Argentina
1,160,Australia
2,260,Canada
3,275,Chile
4,280,China


In [99]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    country_code;
"""
store_query("explore table-6.country_code", sql)

In [100]:
%%sql
# 7. determinative_method_code table
SELECT 
    *
FROM
    determinative_method_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
5 rows affected.


Determin Code,Determinative Method
35,GC/MS/MS - triple quadrupole
52,LC/MS/MS - Liquid Chrom w/ Tandem Mass Spec - triple quad
64,Second LC/MS/MS
72,GC/MSD w/Negative Chemical Ionization (NCI)
80,LC/HRMS - Liquid Chrom w/High Resolution Mass Spec


In [231]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("determinative_method_code", con)
result.head()

Unnamed: 0,Determin Code,Determinative Method
0,35,GC/MS/MS - triple quadrupole
1,52,LC/MS/MS - Liquid Chrom w/ Tandem Mass Spec - ...
2,64,Second LC/MS/MS
3,72,GC/MSD w/Negative Chemical Ionization (NCI)
4,80,LC/HRMS - Liquid Chrom w/High Resolution Mass ...


In [101]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    determinative_method_code;
"""
store_query("explore table-7.determinative_method_code", sql)

In [102]:
%%sql
# 8. distribution_type_code table
SELECT 
    *
FROM
    distribution_type_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
7 rows affected.


DistType Code,Collection Facility Type
D,Distribution Center
F,Farmgate / Produce Stand
G,Grain Lot
H,Wholesale
L,Wholesale and Retail
R,Retail
T,Terminal Market


In [232]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("distribution_type_code", con)
result.head()

Unnamed: 0,DistType Code,Collection Facility Type
0,D,Distribution Center
1,F,Farmgate / Produce Stand
2,G,Grain Lot
3,H,Wholesale
4,L,Wholesale and Retail


In [103]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    distribution_type_code;
"""
store_query("explore table-8.distribution_type_code", sql)

In [104]:
%%sql
# 9 extract_code table
SELECT 
    *
FROM
    extract_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
3 rows affected.


Extract Code,Extraction Method
805,MDA Modified QuEChERS Method
997,OTHER Methods Used for Determinations of Single Components
P90,Sample held by lab more than 90 days before analysis started


In [233]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("extract_code", con)
result.head()

Unnamed: 0,Extract Code,Extraction Method
0,805,MDA Modified QuEChERS Method
1,997,OTHER Methods Used for Determinations of Singl...
2,P90,Sample held by lab more than 90 days before an...


In [105]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    extract_code;
"""
store_query("explore table-9.extract_code table", sql)

In [106]:
%%sql
# 10. lab_code table
SELECT 
    *
FROM
    lab_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
8 rows affected.


Lab Code,Lab Agency Name,Lab City/State
CA1,California Department of Food & Agriculture,"Sacramento, CA"
FL1,Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL"
MI1,Michigan Dept of Agriculture & Rural Development,"East Lansing, MI"
NY1,New York Department of Agriculture and Markets,"Albany, NY"
OH1,Ohio Department of Agriculture,"Reynoldsburg, OH"
TX1,Texas Department of Agriculture,"College Station, TX"
US2,"USDA, AMS, National Science Laboratory","Gastonia, NC"
WA1,Washington State Department of Agriculture,"Yakima, WA"


In [234]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("lab_code", con)
result.head()

Unnamed: 0,Lab Code,Lab Agency Name,Lab City/State
0,CA1,California Department of Food & Agriculture,"Sacramento, CA"
1,FL1,Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL"
2,MI1,Michigan Dept of Agriculture & Rural Development,"East Lansing, MI"
3,NY1,New York Department of Agriculture and Markets,"Albany, NY"
4,OH1,Ohio Department of Agriculture,"Reynoldsburg, OH"


In [107]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    lab_code;
"""
store_query("explore table-10. lab_code", sql)

In [108]:
%%sql
# 11. marketing_claim_code table
SELECT 
    *
FROM
    marketing_claim_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
2 rows affected.


Claim Code,Commodity Marketing Claim
NC,No Claim
PO,Organic


In [235]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("marketing_claim_code", con)
result.head()

Unnamed: 0,Claim Code,Commodity Marketing Claim
0,NC,No Claim
1,PO,Organic


In [109]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    marketing_claim_code;
"""
store_query("explore table-11.marketing_claim_code", sql)

In [110]:
%%sql
# 12. mean_code table
SELECT 
    *
FROM
    mean_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
4 rows affected.


Mean Code,Mean Result Finding
ND,"Non-Detect: Validated, well-recovered"
NP,Non-Detect: Marginal Performing Analyte
O,Detect: Original Extraction Value
R,Detect - Re-extraction Analysis Value


In [236]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("mean_code", con)
result.head()

Unnamed: 0,Mean Code,Mean Result Finding
0,ND,"Non-Detect: Validated, well-recovered"
1,NP,Non-Detect: Marginal Performing Analyte
2,O,Detect: Original Extraction Value
3,R,Detect - Re-extraction Analysis Value


In [111]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    mean_code;
"""
store_query("explore table-12.mean_code", sql)

In [112]:
%%sql
# 13. origin_code table
SELECT 
    *
FROM
    origin_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
3 rows affected.


Origin Code,Origin of Sample
1,Domestic (U.S.)
2,Imported
3,Unknown origin


In [237]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("origin_code", con)
result.head()

Unnamed: 0,Origin Code,Origin of Sample
0,1,Domestic (U.S.)
1,2,Imported
2,3,Unknown origin


In [113]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    origin_code;
"""
store_query("explore table-13.origin_code table", sql)

In [112]:
%%sql
# 14. quantitation_code table
SELECT 
    *
FROM
    quantitation_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
3 rows affected.


Origin Code,Origin of Sample
1,Domestic (U.S.)
2,Imported
3,Unknown origin


In [238]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("quantitation_code", con)
result.head()

Unnamed: 0,Quantitate Code,Quantitation Method
0,E,Estimate
1,P,Marginal Performing Analyte


In [114]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    quantitation_code;
"""
store_query("explore table-14.quantitation_code table", sql)

In [115]:
%%sql
# 15.state_code
SELECT 
    *
FROM
    state_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
35 rows affected.


State Code,State
AL,Alabama
AR,Arkansas
AZ,Arizona
CA,California
CO,Colorado
CT,Connecticut
DE,Delaware
FL,Florida
GA,Georgia
ID,Idaho


In [239]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("state_code", con)
result.head()

Unnamed: 0,State Code,State
0,AL,Alabama
1,AR,Arkansas
2,AZ,Arizona
3,CA,California
4,CO,Colorado


In [116]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    state_code;
"""
store_query("explore table-15.state_code", sql)

In [117]:
%%sql
# 16.test_class_code
SELECT 
    *
FROM
    test_class_code;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
20 rows affected.


Test Class Code,Test (Compound) Class
A,Halogenated
B,Benzimidazole
C,Organophosphorus
D,Avermectin
E,Carbamate
F,Organonitrogen
G,"2,4-D / Acid Herbicides"
I,Other Compounds
J,Imidazolinone
K,Sulfonyl Urea Herbicides


In [240]:
# Pandas equivalent of query
with engine.connect() as con:
    result = pd.read_sql("test_class_code", con)
result.head()

Unnamed: 0,Test Class Code,Test (Compound) Class
0,A,Halogenated
1,B,Benzimidazole
2,C,Organophosphorus
3,D,Avermectin
4,E,Carbamate


In [118]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    test_class_code;
"""
store_query("explore table-16.test_class_code", sql)

**Perform Analysis Queries**

In [119]:
%%sql
# Fetch the number of samples tested
SELECT 
    COUNT(*) AS number_of_samples
FROM
    pdp_samples;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
1 rows affected.


number_of_samples
10127


In [241]:
# Pandas equivalent of query
with engine.connect() as con:
    sample_df= pd.read_sql("pdp_samples", con)

print(len(sample_df))

10127


In [120]:
# Save the query in database
sql = """ 
SELECT 
    COUNT(*) AS number_of_samples
FROM
    pdp_samples;
"""
store_query("Fetch the number of samples tested", sql)

In [121]:
%%sql
# Fetch the number of distinct samples that results was collected
SELECT 
    COUNT(DISTINCT SAMPLE_PK) AS Samples_with_result
FROM
    pdp_results;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
1 rows affected.


Samples_with_result
10127


In [268]:
# Pandas equivalent of query
with engine.connect() as con:
    sample_df= pd.read_sql("pdp_samples", con)

print(len(sample_df.groupby('SAMPLE_PK')))

10127


In [122]:
# Save the query in database
sql = """ 
SELECT 
    COUNT(DISTINCT SAMPLE_PK) AS Samples_with_result
FROM
    pdp_results;
"""
store_query("Fetch the number of distinct samples", sql)

In [123]:
%%sql
# Fetch the number of results
SELECT 
    COUNT(*) AS number_of_results
FROM
    pdp_results;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
1 rows affected.


number_of_results
2737933


In [269]:
# Pandas equivalent of query
with engine.connect() as con:
    results_df= pd.read_sql("pdp_results", con)

print(len(results_df))

2737933


In [124]:
# Save the query in database
sql = """ 
SELECT 
    COUNT(*) AS number_of_results
FROM
    pdp_results;
"""
store_query("Fetch the number of results", sql)

In [261]:
%%sql
# Fetch the number of tests/results per sample
SELECT 
	s.SAMPLE_PK as Sample,
    `Commodity Name` AS Commodity_Name,
    COUNT(r.SAMPLE_PK) as results
FROM pdp_samples s
JOIN pdp_results r ON s.SAMPLE_PK = r.SAMPLE_PK
	JOIN commodity_code c ON s.COMMOD = c.`Commodity Code`
GROUP BY 1, 2
ORDER BY 3 ;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
10127 rows affected.


Sample,Commodity_Name,results
8031,Summer Squash,514
8032,Summer Squash,514
8033,Summer Squash,514
8034,Summer Squash,514
8035,Summer Squash,514
8036,Summer Squash,514
8037,Summer Squash,514
8038,Summer Squash,514
8039,Summer Squash,514
8040,Summer Squash,514


In [270]:
# Pandas equivalent of query
with engine.connect() as con:
    commod_df= pd.read_sql("commodity_code", con)

merge_df= pd.merge(results_df, commod_df, right_on='Commodity Code', left_on='COMMOD', how="left")
merge_df.groupby('SAMPLE_PK').count()['Commodity Name'].sort_values(ascending=False)

SAMPLE_PK
8424    514
9752    514
9649    514
9650    514
9665    514
       ... 
7611    115
7612    115
7613    115
7614    115
7608    115
Name: Commodity Name, Length: 10127, dtype: int64

In [126]:
# Save the query in database
sql = """ 
SELECT 
	s.SAMPLE_PK as Sample,
    `Commodity Name` AS Commodity_Name,
    COUNT(r.SAMPLE_PK) as results
FROM pdp_samples s
JOIN pdp_results r ON s.SAMPLE_PK = r.SAMPLE_PK
	JOIN commodity_code c ON s.COMMOD = c.`Commodity Code`
GROUP BY 1, 2;
"""
store_query("Fetch the number of tests/results per sample", sql)

In [127]:
%%sql
# Fetch the top 5 rows of all columns in the samples table
SELECT 
    *
FROM
    pdp_samples
LIMIT 5;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
5 rows affected.


SAMPLE_PK,STATE,YEAR,MONTH,DAY,SITE,COMMOD,SOURCE_ID,VARIETY,ORIGIN,COUNTRY,DISTTYPE,COMMTYPE,CLAIM,QUANTITY,GROWST,PACKST,DISTST
1,CA,21,3,8,43,BB,,,2,595.0,T,FR,NC,,,,IL
2,CA,21,3,8,97,BB,,Na,2,275.0,T,FR,NC,,,,
3,CA,21,3,8,149,BB,,Na,1,,D,FR,PO,,CA,,CA
4,CA,21,3,8,230,BB,,,1,,T,FR,PO,,,,CA
5,CA,21,3,8,268,BB,,,2,275.0,T,FR,NC,,,,CA


In [263]:
# Pandas equivalent of query
sample_df.head()

Unnamed: 0,SAMPLE_PK,STATE,YEAR,MONTH,DAY,SITE,COMMOD,SOURCE_ID,VARIETY,ORIGIN,COUNTRY,DISTTYPE,COMMTYPE,CLAIM,QUANTITY,GROWST,PACKST,DISTST
0,1,CA,21,3,8,43,BB,,,2,595.0,T,FR,NC,,,,IL
1,2,CA,21,3,8,97,BB,,Na,2,275.0,T,FR,NC,,,,
2,3,CA,21,3,8,149,BB,,Na,1,,D,FR,PO,,CA,,CA
3,4,CA,21,3,8,230,BB,,,1,,T,FR,PO,,,,CA
4,5,CA,21,3,8,268,BB,,,2,275.0,T,FR,NC,,,,CA


In [128]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    pdp_samples
LIMIT 5;
"""
store_query("Fetch the top 5 rows of all columns in the samples table", sql)

In [129]:
%%sql
# Fetch the top 5 rows of all columns in the results table
SELECT 
    *
FROM
    pdp_results
LIMIT 5;

 * mysql+pymysql://root:***@localhost/pdp_2021_db
5 rows affected.


SAMPLE_PK,COMMOD,COMMTYPE,LAB,PESTCODE,TESTCLASS,CONCEN,LOD,CONUNIT,CONFMETHOD,CONFMETHOD2,ANNOTATE,QUANTITATE,MEAN,EXTRACT,DETERMIN
1,BB,FR,WA1,1,A,,0.003,M,,,,,ND,805,35
1,BB,FR,WA1,24,C,,0.005,M,,,,,ND,805,64
1,BB,FR,WA1,28,A,,0.01,M,,,,,ND,805,35
1,BB,FR,WA1,32,A,,0.0015,M,,,,,ND,805,64
1,BB,FR,WA1,34,A,,0.01,M,,,,,ND,805,35


In [271]:
# Pandas equivalent of query
results_df.head()

Unnamed: 0,SAMPLE_PK,COMMOD,COMMTYPE,LAB,PESTCODE,TESTCLASS,CONCEN,LOD,CONUNIT,CONFMETHOD,CONFMETHOD2,ANNOTATE,QUANTITATE,MEAN,EXTRACT,DETERMIN
0,1,BB,FR,WA1,1,A,,0.003,M,,,,,ND,805,35
1,1,BB,FR,WA1,24,C,,0.005,M,,,,,ND,805,64
2,1,BB,FR,WA1,28,A,,0.01,M,,,,,ND,805,35
3,1,BB,FR,WA1,32,A,,0.0015,M,,,,,ND,805,64
4,1,BB,FR,WA1,34,A,,0.01,M,,,,,ND,805,35


In [130]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    pdp_results
LIMIT 5;
"""
store_query("Fetch the top 5 rows of all columns in the results table", sql)

In [131]:
%%sql
# Fetch the distint countries from which samples was collected and number of samples
SELECT DISTINCT
    `Country Name` AS Country_Name,
    COUNTRY AS Country_Code,
    COUNT(SAMPLE_PK) AS Number_of_Samples
FROM
    pdp_samples
        JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
GROUP BY 1 , 2
ORDER BY 3 DESC;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
23 rows affected.


Country_Name,Country_Code,Number_of_Samples
Mexico,595,1742
Chile,275,329
Peru,720,236
Canada,260,227
Guatemala,415,206
Argentina,150,147
Honduras,430,94
Uruguay,930,28
Israel,475,22
Morocco,610,22


In [291]:
# Pandas equivalent of query
with engine.connect() as con:
    country_df = pd.read_sql('country_code', con)
sample_full_df = pd.merge(sample_df, country_df, left_on='COUNTRY', right_on='Country Code', how='left')
sample_full_df.head()

ValueError: You are trying to merge on object and int64 columns for key 'COUNTRY'. If you wish to proceed you should use pd.concat

In [290]:
sample_df[sample_df['COUNTRY'] == ""]

Unnamed: 0,SAMPLE_PK,STATE,YEAR,MONTH,DAY,SITE,COMMOD,SOURCE_ID,VARIETY,ORIGIN,COUNTRY,DISTTYPE,COMMTYPE,CLAIM,QUANTITY,GROWST,PACKST,DISTST
2,3,CA,21,3,8,149,BB,,Na,1,,D,FR,PO,,CA,,CA
3,4,CA,21,3,8,230,BB,,,1,,T,FR,PO,,,,CA
21,22,CA,21,4,12,642,BB,,Na,1,,R,FR,PO,,,,CA
25,26,CA,21,5,10,43,BB,,Unknown,1,,T,FR,NC,,,,CA
26,27,CA,21,5,10,49,BB,,Unknown,1,,T,FR,NC,,,,CA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10119,10120,WA,21,11,8,8,WS,,Acorn,1,,D,FR,PO,3.0,,,WA
10121,10122,WA,21,11,8,37,WS,P,Acorn,1,,R,FR,NC,4.0,OR,,
10122,10123,WA,21,11,8,38,WS,,Butternut,1,,D,FR,NC,2.0,,,OR
10123,10124,WA,21,12,13,2,WS,,Acorn,1,,D,FR,NC,3.0,,,WA


In [292]:
sample_df['COUNTRY'].apply(lambda x: x=0 if x == '')

SyntaxError: expression cannot contain assignment, perhaps you meant "=="? (880662666.py, line 1)

In [132]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `Country Name` AS Country_Name,
    COUNTRY AS Country_Code,
    COUNT(SAMPLE_PK) AS Number_of_Samples
FROM
    pdp_samples
        JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
GROUP BY 1 , 2
ORDER BY 3 DESC;

"""
store_query("Fetch the distint countries and number of samples collected", sql)

In [133]:
%%sql
# Fetch the commodity sample collected from each country
SELECT DISTINCT
    `Country Name` AS Country_Name,
    COUNTRY AS Country_Code,
    `Commodity Name` AS Commodity
FROM
    pdp_samples
        LEFT JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
        JOIN
    commodity_code ON pdp_samples.COMMOD = commodity_code.`Commodity Code`
ORDER BY 1;



 * mysql+pymysql://root:***@localhost/pdp_2021_db
102 rows affected.


Country_Name,Country_Code,Commodity
,,Winter Squash
,M99,Grape Juice
,,"Blueberries, Cultivated"
,,Eggplant
,M01,Grape Juice
,M48,Grape Juice
,,Sweet Bell Peppers
,,Broccoli
,,Watermelon
,M41,Grape Juice


In [134]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `Country Name` AS Country_Name,
    COUNTRY AS Country_Code,
    `Commodity Name` AS Commodity
FROM
    pdp_samples
        LEFT JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
        JOIN
    commodity_code ON pdp_samples.COMMOD = commodity_code.`Commodity Code`
ORDER BY 1;


"""
store_query("Fetch the commodity sample collected from each country", sql)

In [135]:
%%sql
# Fetch the total number of distinct countries samples was gotten from
SELECT 
    COUNT(DISTINCT `Country Name`) AS Number_of_Countries
FROM
    pdp_samples
        JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
        JOIN
    pdp_results ON pdp_samples.SAMPLE_PK = pdp_results.SAMPLE_PK;




 * mysql+pymysql://root:***@localhost/pdp_2021_db
1 rows affected.


Number_of_Countries
23


In [136]:
# Save the query in database
sql = """ 
SELECT 
    COUNT(DISTINCT `Country Name`) AS Number_of_Countries
FROM
    pdp_samples
        JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
        JOIN
    pdp_results ON pdp_samples.SAMPLE_PK = pdp_results.SAMPLE_PK;

"""
store_query("Fetch the total number of distinct countries samples was gotten from", sql)

In [137]:
%%sql
# Fetch the commodity results collect from each country
SELECT DISTINCT
    `Country Name` AS Country_Name,
    COUNTRY AS Country_Code,
    `Commodity Name` AS Commodity
FROM
    pdp_samples
        LEFT JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
        JOIN
    pdp_results ON pdp_samples.SAMPLE_PK = pdp_results.SAMPLE_PK
        JOIN
    commodity_code ON pdp_results.COMMOD = commodity_code.`Commodity Code`
ORDER BY Country_Code;





 * mysql+pymysql://root:***@localhost/pdp_2021_db
102 rows affected.


Country_Name,Country_Code,Commodity
,,Carrots
,,Plums
,,"Blueberries, Cultivated"
,,Broccoli
,,"Blueberries, Frozen"
,,Butter
,,Celery
,,Cauliflower
,,Corn Grain
,,Summer Squash


In [138]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `Country Name` AS Country_Name,
    COUNTRY AS Country_Code,
    `Commodity Name` AS Commodity
FROM
    pdp_samples
        LEFT JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
        JOIN
    pdp_results ON pdp_samples.SAMPLE_PK = pdp_results.SAMPLE_PK
        JOIN
    commodity_code ON pdp_results.COMMOD = commodity_code.`Commodity Code`
ORDER BY Country_Code;

"""
store_query("Fetch the commodity results collect from each country", sql)

In [139]:
%%sql
# Count of results from each country from highest to lowest
SELECT 
    `Country Name` AS Country_Name,
    COUNT(COUNTRY) AS Number_of_Results
FROM
    pdp_samples
        JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
        JOIN
    pdp_results ON pdp_samples.SAMPLE_PK = pdp_results.SAMPLE_PK
GROUP BY `Country Name`
ORDER BY Number_of_Results DESC;





 * mysql+pymysql://root:***@localhost/pdp_2021_db
23 rows affected.


Country_Name,Number_of_Results
Mexico,608072
Chile,73413
Canada,67138
Guatemala,53268
Peru,51282
Honduras,26779
Argentina,24274
Uruguay,6877
Israel,6473
Morocco,5500


In [140]:
# Save the query in database
sql = """ 
SELECT 
    `Country Name` AS Country_Name,
    COUNT(COUNTRY) AS Number_of_Results
FROM
    pdp_samples
        JOIN
    country_code ON pdp_samples.COUNTRY = country_code.`Country Code`
        JOIN
    pdp_results ON pdp_samples.SAMPLE_PK = pdp_results.SAMPLE_PK
GROUP BY `Country Name`
ORDER BY Number_of_Results DESC;

"""
store_query("Count of results from each country from highest to lowest", sql)

In [141]:
%%sql
# Each Commodity with number of test results
SELECT DISTINCT
    `COMMODITY NAME`, 
    COUNT(*) AS Samples
FROM
    pdp_results
        LEFT JOIN
    commodity_code ON commodity_code.`Commodity Code` = pdp_results.COMMOD
GROUP BY `COMMODITY NAME`
ORDER BY Samples DESC;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
21 rows affected.


COMMODITY NAME,Samples
Winter Squash,361593
Green Beans,357543
Summer Squash,264952
Carrots,229821
Eggplant,185710
Broccoli,144052
"Blueberries, Cultivated",140227
Tangerines,132416
Grape Juice,123900
Peaches,122963


In [143]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `COMMODITY NAME`, 
    COUNT(*) AS Samples
FROM
    pdp_results
        LEFT JOIN
    commodity_code ON commodity_code.`Commodity Code` = pdp_results.COMMOD
GROUP BY `COMMODITY NAME`
ORDER BY Samples DESC;

"""
store_query("Each Commodity with number of test results", sql)

In [144]:
%%sql
# Commodities samples not in test results
SELECT DISTINCT
    `COMMODITY NAME`, COUNT(*) AS Samples
FROM
    pdp_samples p
        LEFT JOIN
    commodity_code c ON c.`Commodity Code` = p.COMMOD
        LEFT JOIN
    pdp_results r ON c.`Commodity Code` = r.COMMOD
WHERE
    c.`Commodity Code` NOT IN (r.COMMOD)
GROUP BY `COMMODITY NAME`
ORDER BY Samples DESC;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
0 rows affected.


COMMODITY NAME,Samples


In [145]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `COMMODITY NAME`, COUNT(*) AS Samples
FROM
    pdp_samples p
        LEFT JOIN
    commodity_code c ON c.`Commodity Code` = p.COMMOD
        LEFT JOIN
    pdp_results r ON c.`Commodity Code` = r.COMMOD
WHERE
    c.`Commodity Code` NOT IN (r.COMMOD)
GROUP BY `COMMODITY NAME`
ORDER BY Samples DESC;

"""
store_query("Commodities samples not in test results", sql)

In [146]:
%%sql
# Fetch the Concentration/LOD units to confirm if uniform
SELECT DISTINCT
    `CONUNIT`
FROM
    pdp_results;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
1 rows affected.


CONUNIT
M


In [147]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `CONUNIT`
FROM
    pdp_results;

"""
store_query("Fetch the Concentration/LOD units to confirm if uniform", sql)

In [148]:
%%sql
# Fetch the top 100 samples with the highest Limit of Detection (LOD)
SELECT DISTINCT
    R.SAMPLE_PK, C.`Commodity Name`, R.LOD AS LOD
FROM
    pdp_results R
        LEFT JOIN
    commodity_code C ON R.COMMOD = C.`Commodity Code`
ORDER BY LOD DESC
LIMIT 100;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
100 rows affected.


SAMPLE_PK,Commodity Name,LOD
3185,Corn Grain,0.5
3217,Corn Grain,0.5
3154,Corn Grain,0.5
3140,Corn Grain,0.5
3126,Corn Grain,0.5
3085,Corn Grain,0.5
3112,Corn Grain,0.5
3153,Corn Grain,0.5
3167,Corn Grain,0.5
3139,Corn Grain,0.5


In [149]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    R.SAMPLE_PK, C.`Commodity Name`, R.LOD AS LOD
FROM
    pdp_results R
        LEFT JOIN
    commodity_code C ON R.COMMOD = C.`Commodity Code`
ORDER BY LOD DESC
LIMIT 100;

"""
store_query("Fetch the top 100 samples with the highest Limit of Detection (LOD)", sql)

In [150]:
%%sql
# Which pesticide has the highest Limit of Detection (LOD) per commodity
SELECT DISTINCT
    C.`Commodity Name`, P.`Pesticide Name`, T.LOD
FROM
    pdp_results T
        LEFT JOIN
    commodity_code C ON T.COMMOD = C.`Commodity Code`
        JOIN
    pest_code P ON T.PESTCODE = P.`Pest Code`
ORDER BY C.`Commodity Name` ASC , T.LOD DESC;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
3567 rows affected.


Commodity Name,Pesticide Name,LOD
"Blueberries, Cultivated",MGK-264,0.1
"Blueberries, Cultivated",Profenofos,0.075
"Blueberries, Cultivated",Fluvalinate (as Tau-Fluvalinate),0.05
"Blueberries, Cultivated",Propargite,0.05
"Blueberries, Cultivated",Pendimethalin,0.05
"Blueberries, Cultivated",Abamectin,0.05
"Blueberries, Cultivated",Oxyfluorfen,0.05
"Blueberries, Cultivated",Phenothrin,0.05
"Blueberries, Cultivated",Iprodione,0.04
"Blueberries, Cultivated",Methomyl,0.03


In [151]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    C.`Commodity Name`, P.`Pesticide Name`, T.LOD
FROM
    pdp_results T
        LEFT JOIN
    commodity_code C ON T.COMMOD = C.`Commodity Code`
        JOIN
    pest_code P ON T.PESTCODE = P.`Pest Code`
ORDER BY C.`Commodity Name` ASC , T.LOD DESC;

"""
store_query("Which pesticide has the highest Limit of Detection (LOD) per commodity", sql)

In [152]:
%%sql
# Fetch the average Limit of Detection (LOD) per pesticide per commodity
SELECT 
    C.`Commodity Name`,
    P.`Pesticide Name`,
    ROUND(AVG(T.LOD), 4) AS Average_LOD
FROM
    pdp_results T
        LEFT JOIN
    commodity_code C ON T.COMMOD = C.`Commodity Code`
        JOIN
    pest_code P ON T.PESTCODE = P.`Pest Code`
GROUP BY C.`Commodity Name` , P.`Pesticide Name`
ORDER BY C.`Commodity Name` , Average_LOD DESC;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
3269 rows affected.


Commodity Name,Pesticide Name,Average_LOD
"Blueberries, Cultivated",MGK-264,0.1
"Blueberries, Cultivated",Profenofos,0.075
"Blueberries, Cultivated",Abamectin,0.05
"Blueberries, Cultivated",Fluvalinate (as Tau-Fluvalinate),0.05
"Blueberries, Cultivated",Pendimethalin,0.05
"Blueberries, Cultivated",Oxyfluorfen,0.05
"Blueberries, Cultivated",Phenothrin,0.05
"Blueberries, Cultivated",Propargite,0.05
"Blueberries, Cultivated",Iprodione,0.04
"Blueberries, Cultivated",Methomyl,0.03


In [153]:
# Save the query in database
sql = """ 
SELECT 
    C.`Commodity Name`,
    P.`Pesticide Name`,
    ROUND(AVG(T.LOD), 4) AS Average_LOD
FROM
    pdp_results T
        LEFT JOIN
    commodity_code C ON T.COMMOD = C.`Commodity Code`
        JOIN
    pest_code P ON T.PESTCODE = P.`Pest Code`
GROUP BY C.`Commodity Name` , P.`Pesticide Name`
ORDER BY C.`Commodity Name` , Average_LOD DESC;

"""
store_query("Fetch the average Limit of Detection (LOD) per pesticide per commodity", sql)

In [154]:
%%sql
# Fetch the maximum Limit of Detection (LOD) per pesticide per commodity
SELECT 
    C.`Commodity Name`,
    P.`Pesticide Name`,
    ROUND(MAX(T.LOD), 4) AS Maximum_LOD
FROM
    pdp_results T
        LEFT JOIN
    commodity_code C ON T.COMMOD = C.`Commodity Code`
        JOIN
    pest_code P ON T.PESTCODE = P.`Pest Code`
GROUP BY C.`Commodity Name` , P.`Pesticide Name`
ORDER BY C.`Commodity Name` , Maximum_LOD DESC;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
3269 rows affected.


Commodity Name,Pesticide Name,Maximum_LOD
"Blueberries, Cultivated",MGK-264,0.1
"Blueberries, Cultivated",Profenofos,0.075
"Blueberries, Cultivated",Abamectin,0.05
"Blueberries, Cultivated",Phenothrin,0.05
"Blueberries, Cultivated",Fluvalinate (as Tau-Fluvalinate),0.05
"Blueberries, Cultivated",Pendimethalin,0.05
"Blueberries, Cultivated",Propargite,0.05
"Blueberries, Cultivated",Oxyfluorfen,0.05
"Blueberries, Cultivated",Iprodione,0.04
"Blueberries, Cultivated",Methomyl,0.03


In [155]:
# Save the query in database
sql = """ 
SELECT 
    C.`Commodity Name`,
    P.`Pesticide Name`,
    ROUND(MAX(T.LOD), 4) AS Maximum_LOD
FROM
    pdp_results T
        LEFT JOIN
    commodity_code C ON T.COMMOD = C.`Commodity Code`
        JOIN
    pest_code P ON T.PESTCODE = P.`Pest Code`
GROUP BY C.`Commodity Name` , P.`Pesticide Name`
ORDER BY C.`Commodity Name` , Maximum_LOD DESC;

"""
store_query("Fetch the maximum Limit of Detection (LOD) per pesticide per commodity", sql)

In [156]:
%%sql
# Fetch how many test that did not detect any residue
SELECT 
    count(SAMPLE_PK)
FROM
    pdp_results
WHERE
    MEAN = 'NP' OR MEAN = 'ND'
;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
1 rows affected.


count(SAMPLE_PK)
2711338


In [157]:
# Save the query in database
sql = """ 
SELECT 
    count(SAMPLE_PK)
FROM
    pdp_results
WHERE
    MEAN = 'NP' OR MEAN = 'ND'
;

"""
store_query("Fetch how many test that did not detect any residue", sql)

In [160]:
%%sql
# Fetch test that detected pesticide residue
SELECT 
   count(SAMPLE_PK) as Number_of_detections
FROM
    pdp_results
WHERE
    MEAN NOT IN ('ND' , 'NP')
;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
1 rows affected.


Number_of_detections
26595


In [161]:
# Save the query in database
sql = """ 
SELECT 
   count(SAMPLE_PK) as Number_of_detections
FROM
    pdp_results
WHERE
    MEAN NOT IN ('ND' , 'NP')
;

"""
store_query("Fetch test that detected pesticide residue", sql)

In [162]:
%%sql
# Fetch number of tests that detected pesticide residue for each commodity
SELECT 
    C.`Commodity Name`, 
    COUNT(SAMPLE_PK) AS number_of_detection,
    SUM(COUNT(SAMPLE_PK)) OVER (ORDER BY C.`Commodity Name`) AS running_total_Detection
FROM
    pdp_results T
        JOIN
    commodity_code C ON T.COMMOD = C.`Commodity Code`
WHERE
    MEAN NOT IN ('ND' , 'NP')
GROUP BY 1 
ORDER BY 1, 2 
;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
21 rows affected.


Commodity Name,number_of_detection,running_total_Detection
"Blueberries, Cultivated",2439,2439
"Blueberries, Frozen",80,2519
Broccoli,1456,3975
Butter,459,4434
Cantaloupe,570,5004
Carrots,745,5749
Cauliflower,491,6240
Celery,871,7111
Corn Grain,163,7274
Eggplant,1454,8728


In [163]:
# Save the query in database
sql = """ 
SELECT 
    C.`Commodity Name`, 
    COUNT(SAMPLE_PK) AS number_of_detection,
    SUM(COUNT(SAMPLE_PK)) OVER (ORDER BY C.`Commodity Name`) AS running_total_Detection
FROM
    pdp_results T
        JOIN
    commodity_code C ON T.COMMOD = C.`Commodity Code`
WHERE
    MEAN NOT IN ('ND' , 'NP')
GROUP BY 1 
ORDER BY 1, 2 
;

"""
store_query("Fetch number of tests that detected pesticide residue for each commodity", sql)

In [164]:
%%sql
# Calculate percentage of the results that detected pesticide residue per commodity
WITH detect as (
	SELECT C.`Commodity Name`, count(SAMPLE_PK) as Number_of_Detection 
	FROM
		pdp_results T JOIN commodity_code C ON T.COMMOD = C.`Commodity Code`
	WHERE MEAN NOT IN ('ND', 'NP')
	GROUP BY C.`Commodity Name`
)
SELECT 
	d.`Commodity Name`,
    f.Total_Tests,
	d.Number_of_Detection,
    ROUND((d.Number_of_Detection / f.Total_Tests) *100,2) as Percentage_Detected
FROM
	detect d JOIN (SELECT C.`Commodity Name`, 
							count(SAMPLE_PK) as Total_Tests 
                    FROM
						pdp_results T JOIN commodity_code C ON T.COMMOD = C.`Commodity Code`
					GROUP BY C.`Commodity Name`) as f 
                    ON f.`Commodity Name` = d.`Commodity Name`
                    ORDER BY Percentage_Detected DESC;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
21 rows affected.


Commodity Name,Total_Tests,Number_of_Detection,Percentage_Detected
Pears,111647,3907,3.5
"Blueberries, Frozen",2836,80,2.82
Peaches,122963,2641,2.15
Sweet Bell Peppers,76917,1420,1.85
"Blueberries, Cultivated",140227,2439,1.74
Butter,36657,459,1.25
Grape Juice,123900,1552,1.25
Plums,55695,653,1.17
Tangerines,132416,1529,1.15
Celery,84231,871,1.03


In [165]:
# Save the query in database
sql = """ 
WITH detect as (
	SELECT C.`Commodity Name`, count(SAMPLE_PK) as Number_of_Detection 
	FROM
		pdp_results T JOIN commodity_code C ON T.COMMOD = C.`Commodity Code`
	WHERE MEAN NOT IN ('ND', 'NP')
	GROUP BY C.`Commodity Name`
)
SELECT 
	d.`Commodity Name`,
    f.Total_Tests,
	d.Number_of_Detection,
    ROUND((d.Number_of_Detection / f.Total_Tests) *100,2) as Percentage_Detected
FROM
	detect d JOIN (SELECT C.`Commodity Name`, 
							count(SAMPLE_PK) as Total_Tests 
                    FROM
						pdp_results T JOIN commodity_code C ON T.COMMOD = C.`Commodity Code`
					GROUP BY C.`Commodity Name`) as f 
                    ON f.`Commodity Name` = d.`Commodity Name`
                    ORDER BY Percentage_Detected DESC;

"""
store_query("Calculate percentage of the results that detected pesticide residue per commodity", sql)

In [166]:
%%sql
# Fetch the distint Labs from which test results was gotten
SELECT DISTINCT
    `Lab Agency Name` AS Lab_Name, `Lab City/State` AS Location
FROM
    pdp_results T
        JOIN
    lab_code L ON T.LAB = L.`Lab Code`;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
8 rows affected.


Lab_Name,Location
Washington State Department of Agriculture,"Yakima, WA"
New York Department of Agriculture and Markets,"Albany, NY"
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL"
"USDA, AMS, National Science Laboratory","Gastonia, NC"
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI"
California Department of Food & Agriculture,"Sacramento, CA"
Ohio Department of Agriculture,"Reynoldsburg, OH"
Texas Department of Agriculture,"College Station, TX"


In [167]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `Lab Agency Name` AS Lab_Name, `Lab City/State` AS Location
FROM
    pdp_results T
        JOIN
    lab_code L ON T.LAB = L.`Lab Code`;

"""
store_query("Fetch the distint Labs from which test results was gotten", sql)

In [168]:
%%sql
# Fetch the commodity results from each Lab with their location
SELECT DISTINCT
    `Lab Agency Name` AS lab_Name,
    `Lab City/State` AS Location,
    `Commodity Name` AS Commodity
FROM
    pdp_results
        LEFT JOIN
    commodity_code ON pdp_results.COMMOD = commodity_code.`Commodity Code`
        JOIN
    lab_code L ON pdp_results.LAB = L.`Lab Code`
ORDER BY lab_name; 



 * mysql+pymysql://root:***@localhost/pdp_2021_db
23 rows affected.


lab_Name,Location,Commodity
California Department of Food & Agriculture,"Sacramento, CA",Green Beans
California Department of Food & Agriculture,"Sacramento, CA",Summer Squash
California Department of Food & Agriculture,"Sacramento, CA",Winter Squash
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",Celery
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL","Peaches, Frozen"
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",Peaches
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",Sweet Bell Peppers
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",Carrots
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",Eggplant
New York Department of Agriculture and Markets,"Albany, NY",Broccoli


In [169]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `Lab Agency Name` AS lab_Name,
    `Lab City/State` AS Location,
    `Commodity Name` AS Commodity
FROM
    pdp_results
        LEFT JOIN
    commodity_code ON pdp_results.COMMOD = commodity_code.`Commodity Code`
        JOIN
    lab_code L ON pdp_results.LAB = L.`Lab Code`
ORDER BY lab_name; 


"""
store_query("Fetch the commodity results from each Lab with their location", sql)

In [170]:
%%sql
# Count of results from each lab from highest to lowest
SELECT DISTINCT 
	`Lab Agency Name` as lab_Name, 
    `Lab City/State` as Location, 
    `Commodity Name` as Commodity, 
    COUNT(COMMOD) OVER (PARTITION BY `Lab Code`) as Total_Per_Lab,
    COUNT(COMMOD) OVER (PARTITION BY COMMOD) as Total_Per_Commodity
FROM pdp_results LEFT JOIN commodity_code ON pdp_results.COMMOD = commodity_code.`Commodity Code` 
	JOIN lab_code L on pdp_results.LAB = L.`Lab Code`
		ORDER BY lab_name, Total_Per_Commodity DESC ;




 * mysql+pymysql://root:***@localhost/pdp_2021_db
23 rows affected.


lab_Name,Location,Commodity,Total_Per_Lab,Total_Per_Commodity
California Department of Food & Agriculture,"Sacramento, CA",Winter Squash,906126,361593
California Department of Food & Agriculture,"Sacramento, CA",Green Beans,906126,357543
California Department of Food & Agriculture,"Sacramento, CA",Summer Squash,906126,264952
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",Peaches,320716,122963
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",Celery,320716,84231
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",Sweet Bell Peppers,320716,76917
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL","Peaches, Frozen",320716,36605
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",Carrots,341272,229821
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",Eggplant,341272,185710
New York Department of Agriculture and Markets,"Albany, NY",Eggplant,366520,185710


In [171]:
# Save the query in database
sql = """ 
SELECT DISTINCT 
	`Lab Agency Name` as lab_Name, 
    `Lab City/State` as Location, 
    `Commodity Name` as Commodity, 
    COUNT(COMMOD) OVER (PARTITION BY `Lab Code`) as Total_Per_Lab,
    COUNT(COMMOD) OVER (PARTITION BY COMMOD) as Total_Per_Commodity
FROM pdp_results LEFT JOIN commodity_code ON pdp_results.COMMOD = commodity_code.`Commodity Code` 
	JOIN lab_code L on pdp_results.LAB = L.`Lab Code`
		ORDER BY lab_name, Total_Per_Commodity DESC ;



"""
store_query("Count of results from each lab from highest to lowest", sql)

In [172]:
%%sql
# Fetch the commodity type of each commodity with test results
SELECT 
    `Commodity Name`,
    COMMTYPE AS Commodity_Type,
    COUNT(`Commodity Name`) AS Number_of_Test
FROM
    pdp_results p
        LEFT JOIN
    commodity_code c ON p.COMMOD = c.`Commodity Code`
GROUP BY 1 , 2
ORDER BY 2 , 3 DESC
;




 * mysql+pymysql://root:***@localhost/pdp_2021_db
21 rows affected.


Commodity Name,Commodity_Type,Number_of_Test
Winter Squash,FR,361593
Green Beans,FR,357543
Summer Squash,FR,264952
Carrots,FR,229821
Eggplant,FR,185710
Broccoli,FR,144052
"Blueberries, Cultivated",FR,140227
Tangerines,FR,132416
Peaches,FR,122963
Pears,FR,111647


In [173]:
# Save the query in database
sql = """ 
SELECT 
    `Commodity Name`,
    COMMTYPE AS Commodity_Type,
    COUNT(`Commodity Name`) AS Number_of_Test
FROM
    pdp_results p
        LEFT JOIN
    commodity_code c ON p.COMMOD = c.`Commodity Code`
GROUP BY 1 , 2
ORDER BY 2 , 3 DESC
;



"""
store_query("Fetch the commodity type of each commodity with test results", sql)

In [174]:
%%sql
# Fetch the most common confirmation methods used by each lab
SELECT DISTINCT
    `Lab Agency Name` AS lab_Name,
    `Lab City/State` AS Location,
    `Confirmation Method`,
    COUNT(`Confirmation Method`) AS Amount_of_tests
FROM
    pdp_results p
        LEFT JOIN
    commodity_code ON p.COMMOD = commodity_code.`Commodity Code`
        LEFT JOIN
    lab_code L ON p.LAB = L.`Lab Code`
        LEFT JOIN
    confirmation_method_code m ON p.CONFMETHOD = m.`ConfMethod Code`
WHERE
    CONFMETHOD IS NOT NULL
        AND CONFMETHOD2 IS NOT NULL
GROUP BY 1 , 2 , 3
ORDER BY 1 , 4 DESC;




 * mysql+pymysql://root:***@localhost/pdp_2021_db
25 rows affected.


lab_Name,Location,Confirmation Method,Amount_of_tests
California Department of Food & Agriculture,"Sacramento, CA",LC-MS/MS - triple quadrapole,3878
California Department of Food & Agriculture,"Sacramento, CA",GC/MS/MS - triple quadropole,1561
California Department of Food & Agriculture,"Sacramento, CA",,0
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",LC-MS/MS - triple quadrapole,3151
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",GC/MS/MS - triple quadropole,1877
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",,0
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",LC-MS/MS - triple quadrapole,890
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",GC/MS/MS - triple quadropole,324
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",,0
New York Department of Agriculture and Markets,"Albany, NY",GC/MS/MS - triple quadropole,1297


In [175]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `Lab Agency Name` AS lab_Name,
    `Lab City/State` AS Location,
    `Confirmation Method`,
    COUNT(`Confirmation Method`) AS Amount_of_tests
FROM
    pdp_results p
        LEFT JOIN
    commodity_code ON p.COMMOD = commodity_code.`Commodity Code`
        LEFT JOIN
    lab_code L ON p.LAB = L.`Lab Code`
        LEFT JOIN
    confirmation_method_code m ON p.CONFMETHOD = m.`ConfMethod Code`
WHERE
    CONFMETHOD IS NOT NULL
        AND CONFMETHOD2 IS NOT NULL
GROUP BY 1 , 2 , 3
ORDER BY 1 , 4 DESC;



"""
store_query("Fetch the most common confirmation methods used by each lab", sql)

In [176]:
%%sql
# Fetch the most common determinative methods used by each lab
SELECT DISTINCT
    `Lab Agency Name` AS lab_Name,
    `Lab City/State` AS Location,
    `Determinative Method`,
    COUNT(`Determinative Method`) AS Amount_of_tests
FROM
    pdp_results p
        LEFT JOIN
    commodity_code ON p.COMMOD = commodity_code.`Commodity Code`
        LEFT JOIN
    lab_code L ON p.LAB = L.`Lab Code`
        LEFT JOIN
    determinative_method_code d ON p.DETERMIN = d.`Determin Code`
WHERE
    DETERMIN IS NOT NULL
GROUP BY 1 , 2 , 3
ORDER BY 1 , 4 DESC;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
20 rows affected.


lab_Name,Location,Determinative Method,Amount_of_tests
California Department of Food & Agriculture,"Sacramento, CA",LC/MS/MS - Liquid Chrom w/ Tandem Mass Spec - triple quad,625490
California Department of Food & Agriculture,"Sacramento, CA",GC/MS/MS - triple quadrupole,280636
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",LC/MS/MS - Liquid Chrom w/ Tandem Mass Spec - triple quad,165026
Florida Dept of Agriculture & Consumer Services,"Tallahassee, FL",GC/MS/MS - triple quadrupole,155690
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",GC/MS/MS - triple quadrupole,124630
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",Second LC/MS/MS,109480
Michigan Dept of Agriculture & Rural Development,"East Lansing, MI",LC/MS/MS - Liquid Chrom w/ Tandem Mass Spec - triple quad,107162
New York Department of Agriculture and Markets,"Albany, NY",GC/MS/MS - triple quadrupole,166598
New York Department of Agriculture and Markets,"Albany, NY",LC/MS/MS - Liquid Chrom w/ Tandem Mass Spec - triple quad,100686
New York Department of Agriculture and Markets,"Albany, NY",LC/HRMS - Liquid Chrom w/High Resolution Mass Spec,99236


In [177]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    `Lab Agency Name` AS lab_Name,
    `Lab City/State` AS Location,
    `Determinative Method`,
    COUNT(`Determinative Method`) AS Amount_of_tests
FROM
    pdp_results p
        LEFT JOIN
    commodity_code ON p.COMMOD = commodity_code.`Commodity Code`
        LEFT JOIN
    lab_code L ON p.LAB = L.`Lab Code`
        LEFT JOIN
    determinative_method_code d ON p.DETERMIN = d.`Determin Code`
WHERE
    DETERMIN IS NOT NULL
GROUP BY 1 , 2 , 3
ORDER BY 1 , 4 DESC;
"""
store_query("Fetch the most common determinative methods used by each lab", sql)

In [178]:
%%sql
# Fetch the origin of each test sample
SELECT DISTINCT
    C.`Commodity Name`,
    `Origin of Sample`,
    COUNT(`Origin of Sample`) AS Number_of_Samples
FROM
    pdp_samples R
        LEFT JOIN
    commodity_code C ON R.COMMOD = C.`Commodity Code`
        LEFT JOIN
    origin_code o ON o.`Origin Code` = R.ORIGIN
GROUP BY 1 , 2
;


 * mysql+pymysql://root:***@localhost/pdp_2021_db
51 rows affected.


Commodity Name,Origin of Sample,Number_of_Samples
"Blueberries, Cultivated",Imported,431
"Blueberries, Cultivated",Domestic (U.S.),261
Broccoli,Domestic (U.S.),581
Broccoli,Imported,126
Broccoli,Unknown origin,1
Butter,Domestic (U.S.),170
Butter,Imported,7
"Blueberries, Frozen",Imported,11
"Blueberries, Frozen",Domestic (U.S.),3
Celery,Domestic (U.S.),338


In [179]:
# Save the query in database
sql = """ 
SELECT DISTINCT
    C.`Commodity Name`,
    `Origin of Sample`,
    COUNT(`Origin of Sample`) AS Number_of_Samples
FROM
    pdp_samples R
        LEFT JOIN
    commodity_code C ON R.COMMOD = C.`Commodity Code`
        LEFT JOIN
    origin_code o ON o.`Origin Code` = R.ORIGIN
GROUP BY 1 , 2
;
"""
store_query("Fetch the origin of each test sample", sql)

In [180]:
%%sql
# Fetch samples without COUNTRY
SELECT 
    *
FROM
    pdp_samples
WHERE
    COUNTRY IN ('');


 * mysql+pymysql://root:***@localhost/pdp_2021_db
6918 rows affected.


SAMPLE_PK,STATE,YEAR,MONTH,DAY,SITE,COMMOD,SOURCE_ID,VARIETY,ORIGIN,COUNTRY,DISTTYPE,COMMTYPE,CLAIM,QUANTITY,GROWST,PACKST,DISTST
3,CA,21,3,8,149,BB,,Na,1,,D,FR,PO,,CA,,CA
4,CA,21,3,8,230,BB,,,1,,T,FR,PO,,,,CA
22,CA,21,4,12,642,BB,,Na,1,,R,FR,PO,,,,CA
26,CA,21,5,10,43,BB,,Unknown,1,,T,FR,NC,,,,CA
27,CA,21,5,10,49,BB,,Unknown,1,,T,FR,NC,,,,CA
30,CA,21,5,10,199,BB,P,,1,,R,FR,NC,,,,CA
32,CA,21,5,10,490,BB,,Na,1,,R,FR,NC,,,,CA
38,CA,21,5,10,626,BB,B,Blueberries Snow Cha,1,,R,FR,NC,,,,CA
39,CA,21,6,14,10,BB,,Unknown,1,,H,FR,NC,,,,OR
40,CA,21,6,14,43,BB,,,1,,T,FR,NC,,,,CA


In [181]:
# Save the query in database
sql = """ 
SELECT 
    *
FROM
    pdp_samples
WHERE
    COUNTRY IN ('');
"""
store_query("Fetch samples without COUNTRY", sql)