# Purpose: 
### This notebook compares the 3 excel files we were given with data on RSS:
- RSS Master Data Collection All Jan 25.xlsx --> All 
- Copy of RSS Master Data File 2.xlsx --> Copy 
- RSS Master Data File Jan25.xlsx --> File

### A summary of findings can also be found here: 
https://docs.google.com/document/d/1ctBDy4UXE1FVN9GbdY8gVliDDVLOYHuuZt4t69LCY34/edit?usp=sharing


In [None]:
import warnings
import numpy as np
import pandas as pd
from collections import defaultdict

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 200)

In [None]:
# Use this to read excel files 
!pip install openpyxl

Collecting openpyxl
  Using cached openpyxl-3.0.6-py2.py3-none-any.whl (242 kB)
Collecting jdcal
  Downloading jdcal-1.4.1-py2.py3-none-any.whl (9.5 kB)
Collecting et-xmlfile
  Downloading et_xmlfile-1.0.1.tar.gz (8.4 kB)
Building wheels for collected packages: et-xmlfile
  Building wheel for et-xmlfile (setup.py) ... [?25ldone
[?25h  Created wheel for et-xmlfile: filename=et_xmlfile-1.0.1-py3-none-any.whl size=8913 sha256=b49a1c67d6f62f8ed5a46efebc3039be52586c92f400c89635522f3a6da335ab
  Stored in directory: /root/.cache/pip/wheels/e2/bd/55/048b4fd505716c4c298f42ee02dffd9496bb6d212b266c7f31
Successfully built et-xmlfile
Installing collected packages: jdcal, et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.0.1 jdcal-1.4.1 openpyxl-3.0.6


# Upload the excel files 

In [None]:
df_all = pd.read_excel('data/RSS Master Data Collection All Jan 25.xlsx',engine='openpyxl', parse_dates=True)

In [None]:
df_copy = pd.read_excel('data/Copy of RSS Master Data File 2.xlsx',engine='openpyxl', parse_dates=True)

In [None]:
df_file = pd.read_excel('data/RSS Master Data File Jan25.xlsx',engine='openpyxl', parse_dates=True)

# Comparing Columns in the 3 excel files 

# Standardize Column Names 

In [None]:
# Rename columns to match naming conventions in Copy

df_all.rename(columns = {'Vehicle Type  ':'Vehicle Type', \
                         'Inside/Curb':'I or C?', \
                         'Meandor ': 'Meandor', \
                         'Time(Sec)':'Time', \
                         'Neighborhood ': 'Neighborhood', \
                         'Key Code?':'Locked', \
                         'Truck # ':'Truck #', \
                         'Toter (unit) ': '#Units', \
                         'Steep/Flat':'Hill or Flat?', \
                         'Street Sweeping ': 'Street Sweeping'},inplace=True)

In [None]:
df_all.columns

Index(['Day', 'Route', 'Truck #', 'Vehicle Type', 'Hill or Flat?',
       'Street Sweeping', 'Time', '#Units', 'Toter Size', 'Total Volume',
       'Commodity', 'Tipper', 'Neighborhood', 'I or C?', 'Address #', 'Apt.#',
       'Street', 'Meandor', 'Locked', 'Type'],
      dtype='object')

In [None]:
df_copy.columns

Index(['Date', 'Day', 'Unnamed: 2', 'Route', 'Truck #', 'Vehicle Type',
       'Commodity', 'Tipper', 'Sequence #', 'Address #', 'Apt.#', 'Street',
       'Even/Odd', 'Meandor', 'I or C?', 'Time', 'Block Time', '#Units',
       'Number of Stops', '16 gal', '20 gal', '32 gal', '64 gal', '96 gal',
       'CCAN', 'Cardboard Box', 'Trash Bags', 'Neighborhood', 'Hill or Flat?',
       'Street Sweeping', 'Locked', 'Common Notes', 'Additional Notes',
       'GlobalID', 'x', 'y', 'Data Collector'],
      dtype='object')

In [None]:
all_cols = set(df_all.columns)
copy_cols = set(df_copy.columns)
file_cols = set(df_file.columns)

## First, determine the differences between copy and file 
- File has 41 columns, Copy has 37 columns 
    - File has 7 extra columns specifying the yd bin sizes (1yd, 1.5yd, 2yd, 3yd, 4yd,5yd,6yd)
        - All of these columns are filled with 0s, so these columns are meaningless
    - Copy has 3 extra columns: Data Collector, Neighborhood, Unnamed: 2
        - Data Collector has only one data collector - Norma 
        - Neighborhood only has one neighborhood mentioned - Excelsior - and it is listed 91 times. The rest are null values 
        - Unnamed: 2 only has nan values and should be deleted
- File has 1256 rows, Copy has 5171 rows
    - Refer to Shruti's analysis below to understand why there are so many more rows in Copy than in File 
### Conclusion: Column-wise, the only difference between the two is that Copy contains a neighborhood column that has information about a single neighborhood. The reason that Copy has so many more rows is being explored below. 
    

In [None]:
# How many columns are in each? 
len(df_copy.columns) # 37 columns
len(df_file.columns) # 41 columns

41

In [None]:
# Which columns are in copy that are not in file? 
copy_cols.difference(file_cols)

# Which columns are in file that are not in copy?
file_cols.difference(copy_cols)

# What are the overall differences? 
file_cols.symmetric_difference(copy_cols) 

{'1 yd',
 '1.5 yd',
 '2 yd',
 '3 yd',
 '4 yd ',
 '5 yd ',
 '6 yd ',
 'Data Collector',
 'Neighborhood',
 'Unnamed: 2'}

In [None]:
# The 'Unnamed: 2' column is trash --> it only has nan in the entire column 
df_copy['Unnamed: 2'].value_counts()

Series([], Name: Unnamed: 2, dtype: int64)

In [None]:
# We have Neighborhood information for one neighborhood - Excelsior - and it appears 91 times. 
df_copy['Neighborhood'].value_counts()

Excelsior    91
Name: Neighborhood, dtype: int64

In [None]:
# The Data Collector column has only 1 data colector name - Norma
df_copy['Data Collector'].value_counts()

Norma    3915
Name: Data Collector, dtype: int64

In [None]:
# Even though File has extra columns with the yd sizes, they are all filled with zeros 

# df_file['1 yd'].value_counts()   # All zeros
# df_file['1.5 yd'].value_counts() # All zeros
# df_file['2 yd'].value_counts()   # All zeros
# df_file['3 yd'].value_counts()   # All zeros
# df_file['4 yd '].value_counts()  # All zeros
# df_file['5 yd '].value_counts()  # All zeros
df_file['6 yd '].value_counts()    # All zeros

0    1256
Name: 6 yd , dtype: int64

In [None]:
len(df_file) # 1256 rows
len(df_copy) # 5171 rows 

5171

## Second, compare the differences between All and Copy 
- All has 3729 rows, Copy has 5171 rows 
- Though All has fewer columns than copy, it has some columns that were not in file and copy, including:
    - 'Street Sweeping ',
    - 'Total Volume',
    - 'Toter (unit) ',
    - 'Toter Size', --> Here the data is written as 16, 3(32)
    - 'Type', --> specifies R or C, but 97% are missing values 
### Conclusion: We need to join the data from All and either Copy or File to create our master spreadsheet. We will also have to think carefully about combining duplicate columns (ex. Steep/Flat and Hill or Flat?) 

In [None]:
# How many columns are in each? 
len(df_copy.columns) # 37 columns
len(df_all.columns) # 20 columns

20

In [None]:
# Which columns are in copy that are not in all? 
copy_cols.difference(all_cols)

{'16 gal',
 '20 gal',
 '32 gal',
 '64 gal',
 '96 gal',
 'Additional Notes',
 'Block Time',
 'CCAN',
 'Cardboard Box',
 'Common Notes',
 'Data Collector',
 'Date',
 'Even/Odd',
 'GlobalID',
 'Number of Stops',
 'Sequence #',
 'Trash Bags',
 'Unnamed: 2',
 'x',
 'y'}

In [None]:
# Which columns are in all that are not in copy?
all_cols.difference(copy_cols)

{'Total Volume', 'Toter Size', 'Type'}

In [None]:
# What are the overall differences? 
# --> Notice there are some columns that are the same but are flagged as different due to things like spacing
all_cols.symmetric_difference(copy_cols)

{'16 gal',
 '20 gal',
 '32 gal',
 '64 gal',
 '96 gal',
 'Additional Notes',
 'Block Time',
 'CCAN',
 'Cardboard Box',
 'Common Notes',
 'Data Collector',
 'Date',
 'Even/Odd',
 'GlobalID',
 'Number of Stops',
 'Sequence #',
 'Total Volume',
 'Toter Size',
 'Trash Bags',
 'Type',
 'Unnamed: 2',
 'x',
 'y'}

In [None]:
# What is the difference between df_copy['Hill or Flat?'] and df_all['Holl or Flat?' previously (df_all['Steep/Flat'])
print(df_all['Hill or Flat?'].value_counts()) # --> Has much more missing data 
print(df_copy['Hill or Flat?'].value_counts())


na    1630
F     1185
S      624
Name: Hill or Flat?, dtype: int64
Flat    2220
Hill     277
Name: Hill or Flat?, dtype: int64


In [None]:
print(len(df_all)) # 3729 rows
print(len(df_copy)) # 5171 rows 

3729
5171


In [None]:
# What are the unique columns in all_cols?
all_cols.difference(copy_cols, file_cols)

{'Total Volume', 'Toter Size', 'Type'}

In [None]:
# What are the unique columns in file_cols?
file_cols.difference(copy_cols, all_cols)

{'1 yd', '1.5 yd', '2 yd', '3 yd', '4 yd ', '5 yd ', '6 yd '}

In [None]:
# What are the unique columns in copy_cols?
copy_cols.difference(all_cols, file_cols)

{'Data Collector', 'Unnamed: 2'}

# Comparing Rows in the 3 excel files

## Differences in the rows of the file:

#### 1. 856 number of duplicate rows in All
#### 2. 492 number of duplicate rows in copy
#### 3. 0 number of duplicate rows in file

#### While comparing the rows between copy and file (since all doesnt have the columns) we find
#### out of 5171 entries in 'Copy' 3683 entries were not matching with the rows in 'File'


In [None]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3729 entries, 0 to 3728
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Day               3581 non-null   object 
 1   Route             3729 non-null   int64  
 2   Truck #           3569 non-null   object 
 3   Vehicle Type      3514 non-null   object 
 4   Steep/Flat        3439 non-null   object 
 5   Street Sweeping   3639 non-null   object 
 6   Time(Sec)         3729 non-null   int64  
 7   Toter (unit)      3729 non-null   int64  
 8   Toter Size        3729 non-null   object 
 9   Total Volume      3685 non-null   float64
 10  Commodity         3729 non-null   object 
 11  Tipper            3729 non-null   object 
 12  Neighborhood      91 non-null     object 
 13  Inside/Curb       3729 non-null   object 
 14  Address #         3441 non-null   object 
 15  Apt.#             3 non-null      object 
 16  Street            307 non-null    object 


In [None]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 37 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              1303 non-null   object 
 1   Day               4806 non-null   float64
 2   Unnamed: 2        0 non-null      float64
 3   Route             5032 non-null   float64
 4   Truck #           3751 non-null   float64
 5   Vehicle Type      2299 non-null   object 
 6   Commodity         5171 non-null   object 
 7   Tipper            4381 non-null   float64
 8   Sequence #        2211 non-null   float64
 9   Address #         1668 non-null   object 
 10  Apt.#             6 non-null      object 
 11  Street            1746 non-null   object 
 12  Even/Odd          12 non-null     object 
 13  Meandor           3871 non-null   object 
 14  I or C?           1303 non-null   object 
 15  Time              5158 non-null   object 
 16  Block Time        15 non-null     float64


In [None]:
df_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1256 entries, 0 to 1255
Data columns (total 41 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              1256 non-null   object 
 1   Day               1256 non-null   int64  
 2   Route             1256 non-null   int64  
 3   Truck #           1256 non-null   int64  
 4   Vehicle Type      1256 non-null   object 
 5   Commodity         1256 non-null   object 
 6   Tipper            1256 non-null   int64  
 7   Sequence #        1256 non-null   int64  
 8   Address #         1225 non-null   object 
 9   Apt.#             3 non-null      object 
 10  Street            1256 non-null   object 
 11  Even/Odd          12 non-null     object 
 12  Meandor           3 non-null      object 
 13  I or C?           1256 non-null   object 
 14  Time              1243 non-null   object 
 15  Block Time        15 non-null     float64
 16  #Units            1240 non-null   float64


In [None]:
df_all_shruti=df_all.drop_duplicates()
df_copy_shruti=df_copy.drop_duplicates()
df_file_shruti=df_file.drop_duplicates()

In [None]:
print(len(df_all)-len(df_all_shruti),"number of duplicate rows in All")
print(len(df_copy)-len(df_copy_shruti),"number of duplicate rows in copy")
print(len(df_file)-len(df_file_shruti),"number of duplicate rows in file")

516 number of duplicate rows in All
284 number of duplicate rows in copy
0 number of duplicate rows in file


## Looking at the granularity of the data

In [None]:
df_all_shruti.columns

Index(['Day', 'Route', 'Truck # ', 'Vehicle Type  ', 'Steep/Flat',
       'Street Sweeping ', 'Time(Sec)', 'Toter (unit) ', 'Toter Size',
       'Total Volume', 'Commodity', 'Tipper', 'Neighborhood ', 'Inside/Curb',
       'Address #', 'Apt.#', 'Street', 'Meandor ', 'Key Code?', 'Type'],
      dtype='object')

In [None]:
#df_all_shruti.columns
df_all_grouped=df_all_shruti.groupby(by=['Day', 'Route','Truck # ','Street','Address #']).size().reset_index(name='count')
df_all_grouped[df_all_grouped['count']>1].sort_values(by=['count'])

Unnamed: 0,Day,Route,Truck #,Street,Address #,count
38,2,68,14562,San Bruno,2600,2
44,2,68,14562,San Bruno,2900,2
62,4,59,14559,Parnassus,101,2
35,2,68,14562,San Bruno,2400,3
46,2,68,14562,Wayland,140,4
36,2,68,14562,San Bruno,2500,5
31,2,68,14562,Dwight,0,6
33,2,68,14562,Olmstead,0,6
41,2,68,14562,San Bruno,2900,8
48,2,68,14562,Woosley,100,8


In [None]:
df_all_shruti.loc[(df_all_shruti['Day']==2) & (df_all_shruti['Street']=='San Bruno') & (df_all_shruti['Address #']==2700)]['Toter Size'].explode().values

array([32, '(6)32', '(3)32', '(3)32', '(2)32,96', '20,32,(2)64',
       '(3)32,(2)64 + blade', '(2)32,(2)64', '(2)32,64', 32, 32, 32,
       '(2)32', 96, 96, '32,64'], dtype=object)

In [None]:
df_copy_shruti.columns

Index(['Date', 'Day', 'Unnamed: 2', 'Route', 'Truck #', 'Vehicle Type',
       'Commodity', 'Tipper', 'Sequence #', 'Address #', 'Apt.#', 'Street',
       'Even/Odd', 'Meandor', 'I or C?', 'Time', 'Block Time', '#Units',
       'Number of Stops', '16 gal', '20 gal', '32 gal', '64 gal', '96 gal',
       'CCAN', 'Cardboard Box', 'Trash Bags', 'Neighborhood', 'Hill or Flat?',
       'Street Sweeping', 'Locked', 'Common Notes', 'Additional Notes',
       'GlobalID', 'x', 'y', 'Data Collector'],
      dtype='object')

In [None]:
df_copy_shruti.groupby(by=['Day', 'Route','Truck #','Address #','Sequence #']).size().reset_index(name='count')
#df_all_grouped[df_all_grouped['count']>1].sort_values(by=['count'])

Unnamed: 0,Day,Route,Truck #,Address #,Sequence #,count
0,1.0,1.0,14393.0,2,40.0,1
1,1.0,1.0,14393.0,2,121.0,1
2,1.0,1.0,14393.0,2,167.0,1
3,1.0,1.0,14393.0,9,150.0,1
4,1.0,1.0,14393.0,25,47.0,1
...,...,...,...,...,...,...
1260,5.0,912.0,14611.0,5530/5540,88.0,1
1261,5.0,912.0,14611.0,5748/5746,102.0,1
1262,5.0,912.0,14611.0,5931/5937,97.0,1
1263,5.0,912.0,14611.0,5951/5939,95.0,1


In [None]:
df_copy_shruti.loc[(df_copy_shruti['Day']==2) & (df_copy_shruti['Street']=='San Bruno') & (df_copy_shruti['Address #']==2700)]

Unnamed: 0,Date,Day,Unnamed: 2,Route,Truck #,Vehicle Type,Commodity,Tipper,Sequence #,Address #,Apt.#,Street,Even/Odd,Meandor,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,Cardboard Box,Trash Bags,Neighborhood,Hill or Flat?,Street Sweeping,Locked,Common Notes,Additional Notes,GlobalID,x,y,Data Collector
2203,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,42,,1.0,0.0,0.0,0.0,0.0,1.0,,0,0,,,Flat,,,,,,,,Norma
2204,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,35,,1.0,0.0,0.0,0.0,0.0,1.0,,0,0,,,Flat,,,,,,,,Norma
2206,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,28,,1.0,0.0,0.0,1.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
2210,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,26,,1.0,0.0,0.0,1.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
2211,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,24,,1.0,0.0,0.0,1.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
2212,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,34,,1.0,0.0,0.0,1.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
3570,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,36,,2.0,0.0,0.0,2.0,0.0,0.0,,0,0,,,Flat,,,,,,,,Norma
3571,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,42,,2.0,0.0,0.0,1.0,1.0,0.0,,0,0,,,Flat,,,,,,,,Norma
4083,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,66,,3.0,0.0,0.0,3.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
4084,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2700,,San Bruno,,C,,65,,3.0,0.0,0.0,3.0,0.0,0.0,,0,0,,,Flat,Y,,,,,,,Norma


In [None]:
df_subset = df_copy_shruti[['Day','Street','Address #']]
df_subset.drop_duplicates(inplace=True)
df_subset[df_subset['Street']=='San Bruno']

Unnamed: 0,Day,Street,Address #
2203,2.0,San Bruno,2700
2205,2.0,San Bruno,2900
2207,2.0,San Bruno,2400
2209,2.0,San Bruno,2600
2216,2.0,San Bruno,3000
2218,2.0,San Bruno,2910
3573,2.0,San Bruno,2500
3594,2.0,San Bruno,2845
4092,2.0,San Bruno,2574


In [None]:
## There is something fishy here, how do we prove it ? 
df_subset = df_copy_shruti[['Day','Route','Street','Address #']]
df_subset.drop_duplicates(inplace=True)
df_subset
Day=df_subset['Day'].tolist()
Street=df_subset['Street'].tolist()
Address=df_subset['Address #'].tolist()
Route=df_subset['Route'].tolist()

Compare=defaultdict()


for x,y,z,a in zip(Day,Street,Address,Route):

    key=str(x)+'_'+str(y)+'_'+str(z)+'_'+str(a)
    Compare[key]=( df_all_shruti.loc[(df_all_shruti['Day']==x) & (df_all_shruti['Street']==y) & (df_all_shruti['Address #']==z)&(df_all_shruti['Route']==a)]['Toter Size'].explode(),
                    df_copy_shruti.loc[(df_copy_shruti['Day']==x) & (df_copy_shruti['Street']==y) & (df_copy_shruti['Address #']==z)& (df_copy_shruti['Route']==a)][['16 gal', '20 gal', '32 gal', '64 gal', '96 gal']]
    )


In [None]:
import csv
with open('output.csv', 'w') as csvfile:
    csvwriter = csv.writer(csvfile)
    for x in Compare:
        if(Compare[x][0].any()):
            for y,z in zip(Compare[x][0].values,Compare[x][1].values):
                row=[x,str(y),str(z)]
                csvwriter.writerow(row)

In [None]:
def comparision_rows(df1,df2):

    list_1 = set(df1.columns)
    list_2 = set(df2.columns)

    merge_list = (list(list_1.intersection(list_2)))

    for x in merge_list:
        df2[x]=df2[x].astype(df1[x].dtype,errors='ignore')

    df_merged = df1.merge(df2,on=merge_list,how='inner')
    df_left_merged = df1.merge(df2,on=merge_list,how='left')

    #Check if the combination of columns has duplicates
    check=df1.duplicated(subset=merge_list).any()

    return merge_list,check,df_merged,df_left_merged



In [None]:
list_1 = set(df_copy_shruti.columns)
list_2 = set(df_all_shruti.columns)
merge_list = (list(list_1.intersection(list_2)))
print(merge_list)

['Meandor', 'I or C?', 'Address #', 'Time', 'Day', 'Vehicle Type', 'Route', 'Street', 'Tipper', 'Commodity', 'Apt.#']


0       68.0
1       68.0
2       68.0
3       68.0
4       68.0
        ... 
3724    37.0
3725    37.0
3726    37.0
3727    37.0
3728    37.0
Name: Route, Length: 2873, dtype: float64

In [None]:
# Comparing rows between copy and all
merge_list,check,df_merge_all_copy,df_left_merge_all_copy=comparision_rows(df_copy_shruti,df_all_shruti)
print(merge_list)
print("Are there duplicates at the merge?",check,"\nthere are",len(df_all_shruti)-len(df_merge_all_copy),"rows are missing")
print("There are",df_left_merge_all_copy['Toter Size'].isnull().sum(),"unmatched rows in the left table")


['Meandor', 'I or C?', 'Address #', 'Time', 'Day', 'Vehicle Type', 'Route', 'Street', 'Tipper', 'Commodity', 'Apt.#']
Are there duplicates at the merge? True 
there are 2873 rows are missing
There are 4679 unmatched rows in the left table


In [None]:
df_merge_all_copy

Unnamed: 0,Date,Day,Unnamed: 2,Route,Truck #,Vehicle Type,Commodity,Tipper,Sequence #,Address #,Apt.#,Street,Even/Odd,Meandor,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,Cardboard Box,Trash Bags,Neighborhood,Hill or Flat?,Street Sweeping,Locked,Common Notes,Additional Notes,GlobalID,x,y,Data Collector,Truck #.1,Vehicle Type.1,Steep/Flat,Street Sweeping.1,Time(Sec),Toter (unit),Toter Size,Total Volume,Neighborhood.1,Inside/Curb,Meandor.1,Key Code?,Type


In [None]:
df_left_merge_all_copy[['Day','Route','Time','Address #','Truck #','16 gal','20 gal','32 gal','64 gal','96 gal','Toter Size']]

Unnamed: 0,Day,Route,Time,Address #,Truck #,16 gal,20 gal,32 gal,64 gal,96 gal,Toter Size
0,2.0,912.0,118,5128/5132,14611.0,0.0,0.0,1.0,2.0,2.0,
1,2.0,912.0,59,5620,14611.0,0.0,0.0,0.0,0.0,1.0,
2,2.0,912.0,86,1947,14611.0,0.0,0.0,0.0,0.0,1.0,
3,2.0,912.0,41,1919,14611.0,0.0,0.0,1.0,0.0,1.0,
4,2.0,912.0,31,1909,14611.0,0.0,0.0,1.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...
4674,5.0,17.0,224,1200,14458.0,1.0,0.0,6.0,2.0,,
4675,5.0,17.0,147,1200,14458.0,3.0,1.0,3.0,0.0,,
4676,5.0,17.0,63,1200,14458.0,0.0,2.0,1.0,0.0,,
4677,5.0,17.0,34,3300,14458.0,1.0,0.0,1.0,0.0,,


In [None]:
# Comparing rows between copy and file
merge_list,check,df_merge_copy_file,df_left_merge_copy_file=comparision_rows(df_copy_shruti,df_file_shruti)

print(merge_list)

print("Are there duplicates at the merge?",check,"\nthere are",len(df_file_shruti)-len(df_merge_copy_file),"rows are missing")

print("There are",df_left_merge_copy_file['1 yd'].isnull().sum(),"unmatched rows in the left table")


['I or C?', 'Hill or Flat?', 'Even/Odd', 'Route', 'Tipper', '#Units', 'Number of Stops', 'Street Sweeping', 'Trash Bags', 'Apt.#', '96 gal', 'Meandor', 'Time', 'Cardboard Box', '16 gal', 'Common Notes', 'CCAN', 'Locked', 'Truck #', '32 gal', '64 gal', '20 gal', 'Commodity', 'y', 'GlobalID', 'Date', 'Address #', 'Day', 'Block Time', 'Vehicle Type', 'Additional Notes', 'Sequence #', 'x', 'Street']
Are there duplicates at the merge? False 
there are 260 rows are missing
There are 3683 unmatched rows in the left table


In [None]:
# Inner merge between copy and file 
df_merge_copy_file

Unnamed: 0,Date,Day,Unnamed: 2,Route,Truck #,Vehicle Type,Commodity,Tipper,Sequence #,Address #,Apt.#,Street,Even/Odd,Meandor,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,Cardboard Box,Trash Bags,Neighborhood,Hill or Flat?,Street Sweeping,Locked,Common Notes,Additional Notes,GlobalID,x,y,Data Collector,1 yd,1.5 yd,2 yd,3 yd,4 yd,5 yd,6 yd
0,3/10/2020,2.0,,912.0,14611.0,S-HEIL,Recycle,2.0,1.0,5128/5132,,Geary St,,,C,118,,5.0,1.0,0.0,0.0,1.0,2.0,2.0,0,0,0.0,,Flat,N,,,,fb754c8d-6df1-4056-83b7-3de841764da6,-122.474566,37.780711,,0,0,0,0,0,0,0
1,3/10/2020,2.0,,912.0,14611.0,S-HEIL,Recycle,2.0,2.0,5620,,Geary St,,,C,59,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0,1,0.0,,Flat,Y,,,,37486b44-bec4-4336-9c9d-b24c1443c358,-122.479941,37.780396,,0,0,0,0,0,0,0
2,3/10/2020,2.0,,912.0,14611.0,S-HEIL,Recycle,2.0,3.0,1947,,Clement St,,,I,86,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,,Flat,N,,"narrow_walkway,enters_garage_exits_locked_door",,cbe31490-4c96-44c7-bb07-3bb52f85eb48,-122.480585,37.782037,,0,0,0,0,0,0,0
3,3/10/2020,2.0,,912.0,14611.0,S-HEIL,Recycle,2.0,4.0,1919,,Clement St,,,C,41,,2.0,1.0,0.0,0.0,1.0,0.0,1.0,0,0,0.0,,Flat,N,,,,14f82d64-be14-4cb9-b201-42db049d93bf,-122.480040,37.782229,,0,0,0,0,0,0,0
4,3/10/2020,2.0,,912.0,14611.0,S-HEIL,Recycle,2.0,5.0,1909,,Clement St,,,C,31,,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0,0,0.0,,,,,,,2bed2bbb-3b0d-4639-b719-61e7fedf748b,-122.479709,37.782167,,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,3/12/2020,4.0,,1.0,14393.0,S-HEIL,Garbage/Compost,2.0,201.0,684,,48th Ave,,,I,90,,2.0,1.0,0.0,0.0,0.0,0.0,2.0,0,0,0.0,,Hill,N,key,garage_basement,"Service through garage, steep",35b4a191-9239-4dcb-b47b-ba4b72de1faf,-122.509131,37.775498,,0,0,0,0,0,0,0
992,3/12/2020,4.0,,1.0,14393.0,S-HEIL,Garbage/Compost,2.0,202.0,680,,48th Ave,,,C,42,,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0,0,0.0,,Flat,N,,,,0459b4c8-0bf7-48a6-be79-2b6730af1453,-122.509155,37.775590,,0,0,0,0,0,0,0
993,3/12/2020,4.0,,1.0,14393.0,S-HEIL,Garbage/Compost,2.0,203.0,677,,48th Ave,,,C,43,,2.0,1.0,0.0,0.0,1.0,0.0,1.0,0,0,0.0,,Flat,Y,,,In parking bay,372f2b02-1e79-48bb-b55b-0ef728648b2a,-122.509325,37.775609,,0,0,0,0,0,0,0
994,3/12/2020,4.0,,1.0,14393.0,S-HEIL,Garbage/Compost,2.0,204.0,679,,48th Ave,,,I,48,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,,Flat,Y,key,,Locked door,2257c2e0-844b-42b3-af07-d961c5e4940e,-122.509534,37.775635,,0,0,0,0,0,0,0


In [None]:
df_left_merge_copy_file[df_left_merge_copy_file['1 yd'].isnull()]['Route'].unique()

array([  5.,   2., 958., 914., 919., 932., 901., 937., 925., 907., 910.,
       930., 944.,   9.,  27.,  72., 920., 498.,  54.,  95.,  46.,  79.,
        30.,  49.,  47.,  39.,  17.,   1.,  36.,  28.,  65.,  75.,  56.,
        31.,  34., 100.,  59.,   3.,  37.,  91.,  68.,  nan])

In [None]:
# Compare file and copy 
merge_list,check,df_merge_file_copy,df_left_merge_file_copy=comparision_rows(df_file_shruti,df_copy_shruti)
df_left_merge_file_copy['Route'].unique()

array([912, 918,   1,   5,   2])

In [None]:
routes_list = sorted(df_all_shruti['Route'].unique())
print(routes_list)

[1, 3, 9, 17, 27, 28, 30, 31, 34, 36, 37, 39, 46, 47, 49, 54, 56, 59, 65, 68, 72, 75, 79, 91, 95, 100, 498, 901, 907, 910, 914, 919, 920, 925, 930, 932, 937, 944, 958]


In [None]:
# Comparing rows between all and file
check,df_merge_all_file,df_left_merge_all_file=comparision_rows(df_all_shruti,df_file_shruti)

print("Are there duplicates at the merge?",check,"\nthere are",len(df_file_shruti)-len(df_merge_all_file),"rows are missing")

print("There are",df_left_merge_all_file['1 yd'].isnull().sum(),"unmatched rows in the left table")

Are there duplicates at the merge? True 
there are 1256 rows are missing
There are 2873 unmatched rows in the left table


In [None]:
df_merge_all_file

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Steep/Flat,Street Sweeping,Time(Sec),Toter (unit),Toter Size,Total Volume,Commodity,Tipper,Neighborhood,Inside/Curb,Address #,Apt.#,Street,Meandor,Key Code?,Type,Date,Truck #.1,Vehicle Type.1,Sequence #,Even/Odd,Meandor.1,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,1 yd,1.5 yd,2 yd,3 yd,4 yd,5 yd,6 yd,Cardboard Box,Trash Bags,Hill or Flat?,Street Sweeping.1,Locked,Common Notes,Additional Notes,GlobalID,x,y


In [None]:
df_left_merge_all_file

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Steep/Flat,Street Sweeping,Time(Sec),Toter (unit),Toter Size,Total Volume,Commodity,Tipper,Neighborhood,Inside/Curb,Address #,Apt.#,Street,Meandor,Key Code?,Type,Date,Truck #.1,Vehicle Type.1,Sequence #,Even/Odd,Meandor.1,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,1 yd,1.5 yd,2 yd,3 yd,4 yd,5 yd,6 yd,Cardboard Box,Trash Bags,Hill or Flat?,Street Sweeping.1,Locked,Common Notes,Additional Notes,GlobalID,x,y
0,2,68,14562,,F,N,101,6,"16,(5)32",176.0,GB,2,,C,2900,,San Bruno,,,C,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,68,14562,,F,N,24,2,(2)32,64.0,GB,2,,C,2900,,San Bruno,,,C,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2,68,14562,,F,N,129,5,"16,(4)32",144.0,GB,2,,C,2900,,San Bruno,,,C,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2,68,14562,,F,N,21,1,32,32.0,GB,2,,C,2900,,San Bruno,,,C,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2,68,14562,,F,N,28,1,32,32.0,GB,2,,C,2700,,San Bruno,,,C,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2868,4,37,,,S,N,32,2,3264,96.0,GB,2,Excelsior,C,200,,South Hill,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2869,4,37,,,S,N,60,3,162032,68.0,GB,2,Excelsior,C,200,,South Hill,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2870,4,37,,,S,Y,19,1,20,20.0,GB,2,Excelsior,C,200,,South Hill,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2871,4,37,,,S,Y,20,1,16,16.0,GB,2,Excelsior,C,200,,South Hill,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
# df_file includes data from 6 dates 
df_file['Date'].value_counts()

3/11/2020    477
3/12/2020    205
3/09/2020    193
3/10/2020    177
3/13/2020    139
3/9/2020      65
Name: Date, dtype: int64

### Compare a few of the rows in each dataframe where Time=118 seconds

In [None]:
df_file.loc[df_file.Time==118]

Unnamed: 0,Date,Day,Route,Truck #,Vehicle Type,Commodity,Tipper,Sequence #,Address #,Apt.#,Street,Even/Odd,Meandor,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,1 yd,1.5 yd,2 yd,3 yd,4 yd,5 yd,6 yd,Cardboard Box,Trash Bags,Hill or Flat?,Street Sweeping,Locked,Common Notes,Additional Notes,GlobalID,x,y
0,3/10/2020,2,912,14611,S-HEIL,Recycle,2,1,5128/5132,,Geary St,,,C,118,,5.0,1,0,0,1,2,2,0,0,0,0,0,0,0,0,0,0,Flat,N,,,,fb754c8d-6df1-4056-83b7-3de841764da6,-122.474566,37.780711
124,3/10/2020,2,912,14611,S-HEIL,Recycle,2,125,181/178,,23rd Ave,,,I,118,,3.0,1,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,Flat,N,other,,,770319ba-b112-4cde-8d4b-bf1fd1ca36dd,-122.483016,37.78439


In [None]:
df_copy.loc[df_copy.Time==118]

Unnamed: 0,Date,Day,Unnamed: 2,Route,Truck #,Vehicle Type,Commodity,Tipper,Sequence #,Address #,Apt.#,Street,Even/Odd,Meandor,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,Cardboard Box,Trash Bags,Neighborhood,Hill or Flat?,Street Sweeping,Locked,Common Notes,Additional Notes,GlobalID,x,y,Data Collector
0,3/10/2020,2.0,,912.0,14611.0,S-HEIL,Recycle,2.0,1.0,5128/5132,,Geary St,,,C,118,,5.0,1.0,0.0,0.0,1.0,2.0,2.0,0,0,0.0,,Flat,N,,,,fb754c8d-6df1-4056-83b7-3de841764da6,-122.474566,37.780711,
124,3/10/2020,2.0,,912.0,14611.0,S-HEIL,Recycle,2.0,125.0,181/178,,23rd Ave,,,I,118,,3.0,1.0,0.0,0.0,0.0,2.0,1.0,0,0,0.0,,Flat,N,other,,,770319ba-b112-4cde-8d4b-bf1fd1ca36dd,-122.483016,37.78439,
1322,,4.0,,914.0,,,Recycle,2.0,19.0,,,,,I,,118,,1.0,0.0,0.0,0.0,0.0,1.0,,0,0,,,,,,,,,,,Norma
2112,,2.0,,56.0,,HEIL,Garbage/Compost,2.0,,2737,,Sutter,,I,,118,,1.0,0.0,0.0,0.0,0.0,1.0,,0,0,,,,,,,,,,,Norma
2136,,5.0,,100.0,,HEIL,Garbage/Compost,2.0,,929,,Oak,,I,,118,,1.0,0.0,0.0,0.0,0.0,1.0,,0,0,,,,N,,,,,,,Norma
3512,,4.0,,59.0,14559.0,,Garbage/Compost,2.0,,614,,Polk,,I,,118,,2.0,0.0,0.0,1.0,0.0,1.0,,0,0,,,,N,,,,,,,Norma
4771,,5.0,,75.0,14415.0,HEIL,Garbage/Compost,2.0,,,,,,C,,118,,5.0,1.0,0.0,4.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
4773,,5.0,,75.0,14415.0,HEIL,Garbage/Compost,2.0,,,,,,C,,118,,0.0,,,,,,,0,0,,,Flat,N,,,,,,,Norma
4888,,5.0,,75.0,14415.0,HEIL,Garbage/Compost,2.0,,,,,,C,,118,,5.0,2.0,0.0,3.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
4912,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,140,,Wayland,,C,,118,,6.0,2.0,0.0,4.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma


In [None]:
df_all.loc[df_all['Time(Sec)']==118]

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Steep/Flat,Street Sweeping,Time(Sec),Toter (unit),Toter Size,Total Volume,Commodity,Tipper,Neighborhood,Inside/Curb,Address #,Apt.#,Street,Meandor,Key Code?,Type
43,2,68,14562,,F,N,118,6,"(2)16,(4)32",160.0,GB,2,,C,140,,Wayland,,,R
377,5,75,14415,HIEL,F,N,118,5,"16,(4)32",144.0,GB,2,,C,na,,,,,
379,5,75,14415,HIEL,F,N,118,5,"(2)16,(3)32",128.0,GB,2,,C,na,,,,,
381,5,75,14415,HIEL,F,N,118,6,"(2)16,20,(3)32",148.0,GB,2,,C,na,,,,,
1187,5,100,,HEIL,,N,118,1,96,96.0,GB,2,,I,929,,Oak Street,,,
1246,2,56,na,HIEL,na,na,118,1,96,96.0,GB,2,,I,2737,,Sutter Street,,,
1566,4,914,na,na,na,na,118,1,96,96.0,R,2,,I,na,,,,,
2928,4,59,14559,,,N,118,2,3296,128.0,GB,2,,I,614,,Polk,,,


### Which Routes are represented in each dataframe? 

In [None]:
all_routes = set(df_all['Route'].unique())
print(sorted(list(all_routes)))

[1, 3, 9, 17, 27, 28, 30, 31, 34, 36, 37, 39, 46, 47, 49, 54, 56, 59, 65, 68, 72, 75, 79, 91, 95, 100, 498, 901, 907, 910, 914, 919, 920, 925, 930, 932, 937, 944, 958]


In [None]:
copy_routes = set(df_copy['Route'].unique())
print(len(copy_routes))
print(sorted(list(copy_routes)))

[nan, 1.0, 2.0, 3.0, 5.0, 9.0, 17.0, 27.0, 28.0, 30.0, 31.0, 34.0, 36.0, 37.0, 39.0, 46.0, 47.0, 49.0, 54.0, 56.0, 59.0, 65.0, 68.0, 72.0, 75.0, 79.0, 91.0, 95.0, 100.0, 498.0, 901.0, 907.0, 910.0, 912.0, 914.0, 918.0, 919.0, 920.0, 925.0, 930.0, 932.0, 937.0, 944.0, 958.0]


In [None]:
df_all['Toter Size'].unique()

array(['16,(5)32', '(2)32', '16,(4)32', 32, '(6)32', '(3)32',
       '(2)16,(2)32,96', '(2)64 + blade', 64, '32,(2)96', '32,(3)64',
       '(2)64', '(2)32,96', '20,32,(2)64', '(3)32,(2)64 + blade',
       '(2)32,(2)64', '(2)32,64', '(2)96', '32,64', '20,(5)32',
       '20,(3)32', '(5)32,64', '20,32', '(4)64', '(3)32,(2)64',
       '(3)32,64', 96, '(2)32,64,(2)96', '(2)16,(4)32', '16,(3)32',
       '32,64,96', '16,20,(2)32', '(2)16,32', '(2)16,(4)32,64',
       '20,32,(3)64', 20, 16, '(2)32,64,96', '32,96', '(4)32',
       '(2)20,(4)32', '16,32', '64,(2)96', '64,96', '32,(2)64',
       '(2)64,96', '(3)64', '64,(3)96', '16,64', '16,(2)32',
       '(2)16,32,64', '(2)16,(2)32', '16,20', '16,20,32', '(2)20,32',
       '(2)20,32,64', '16,20,(2)64', '(2)20,(2)32', '20,(2)32,64',
       '16,(2)20,32', '32,64,(3)96', '16,64 + 3 bags', '16,20,(4)32',
       '16,32 + 4 bags', '(2)16,32 + 4 bags', '(2)16,(2)20,(6)32,64',
       '(2)16,(3)32', '16,(2)32,64', '(4)32,64', '16,(3)32 + 1 bag',
       '

In [None]:
df_all

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Steep/Flat,Street Sweeping,Time(Sec),Toter (unit),Toter Size,Total Volume,Commodity,Tipper,Neighborhood,Inside/Curb,Address #,Apt.#,Street,Meandor,Key Code?,Type
0,2,68,14562,,F,N,101,6,"16,(5)32",176.0,GB,2,,C,2900,,San Bruno,,,C
1,2,68,14562,,F,N,24,2,(2)32,64.0,GB,2,,C,2900,,San Bruno,,,C
2,2,68,14562,,F,N,129,5,"16,(4)32",144.0,GB,2,,C,2900,,San Bruno,,,C
3,2,68,14562,,F,N,21,1,32,32.0,GB,2,,C,2900,,San Bruno,,,C
4,2,68,14562,,F,N,28,1,32,32.0,GB,2,,C,2700,,San Bruno,,,C
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3724,4,37,,,S,N,32,2,3264,96.0,GB,2,Excelsior,C,200,,South Hill,,,
3725,4,37,,,S,N,60,3,162032,68.0,GB,2,Excelsior,C,200,,South Hill,,,
3726,4,37,,,S,Y,19,1,20,20.0,GB,2,Excelsior,C,200,,South Hill,,,
3727,4,37,,,S,Y,20,1,16,16.0,GB,2,Excelsior,C,200,,South Hill,,,


### Sort columns in alphabetical order for easier comparison 

In [None]:
df_all_alphabetized = df_all.reindex(sorted(df_all.columns), axis=1)
df_copy_alphabetized = df_copy.reindex(sorted(df_copy.columns), axis=1)
df_file_alphabetized = df_file.reindex(sorted(df_file.columns), axis=1)

## An example of potential mismatches in the data
- Here it seems that these two rows should be corresponding to the same stop (same address, etc) but the bins are not the same 

In [None]:
df_all_alphabetized.loc[(df_all_alphabetized['Day']==2) & (df_all_alphabetized['Route']==68) & (df_all_alphabetized['Time']==101)]

Unnamed: 0,#Units,Address #,Apt.#,Commodity,Day,Hill or Flat?,I or C?,Locked,Meandor,Neighborhood,Route,Street,Street Sweeping,Time,Tipper,Total Volume,Toter Size,Truck #,Type,Vehicle Type
0,6,2900,,GB,2,F,C,,,,68,San Bruno,N,101,2,176.0,"16,(5)32",14562,C,


In [None]:
df_copy_alphabetized.loc[(df_copy_alphabetized['Day']==2) & (df_copy_alphabetized['Route']==68) & (df_copy_alphabetized['Time']==101)]

Unnamed: 0,#Units,16 gal,20 gal,32 gal,64 gal,96 gal,Additional Notes,Address #,Apt.#,Block Time,CCAN,Cardboard Box,Commodity,Common Notes,Data Collector,Date,Day,Even/Odd,GlobalID,Hill or Flat?,I or C?,Locked,Meandor,Neighborhood,Number of Stops,Route,Sequence #,Street,Street Sweeping,Time,Tipper,Trash Bags,Truck #,Unnamed: 2,Vehicle Type,x,y
4910,6.0,0.0,5.0,0.0,0.0,,,2900,,,0,0,Garbage/Compost,,Norma,,2.0,,,Flat,,,C,,1.0,68.0,,San Bruno,N,101,2.0,,14562.0,,,,


## An example where df_all can help fill in gaps in df_copy 
- In df_copy, the #Units does not match with what is shown --> a 16 gal bin should be marked 

In [None]:

df_all_alphabetized.loc[(df_all['Day']==4) & (df_all['Route']==37) & (df_all['Time']==20)]

Unnamed: 0,#Units,Address #,Apt.#,Commodity,Day,Hill or Flat?,I or C?,Locked,Meandor,Neighborhood,Route,Street,Street Sweeping,Time,Tipper,Total Volume,Toter Size,Truck #,Type,Vehicle Type
3727,1,200,,GB,4,S,C,,,Excelsior,37,South Hill,Y,20,2,16.0,16,,,


In [None]:

df_copy_alphabetized.loc[(df_copy['Day']==4) & (df_copy['Route']==37) & (df_copy['Time']==20)]

Unnamed: 0,#Units,16 gal,20 gal,32 gal,64 gal,96 gal,Additional Notes,Address #,Apt.#,Block Time,CCAN,Cardboard Box,Commodity,Common Notes,Data Collector,Date,Day,Even/Odd,GlobalID,Hill or Flat?,I or C?,Locked,Meandor,Neighborhood,Number of Stops,Route,Sequence #,Street,Street Sweeping,Time,Tipper,Trash Bags,Truck #,Unnamed: 2,Vehicle Type,x,y
2187,1.0,0.0,0.0,0.0,0.0,,,200,,,0,0,Garbage/Compost,,Norma,,4.0,,,,,,C,Excelsior,1.0,37.0,,South Hill,Y,20,2.0,,,,,,


## In another random selection, we can see these rows are complementary. For example df_all is missing Meandor information that df_copy can supply 

In [None]:
df_all_alphabetized.loc[(df_all['Day']==2) & (df_all['Route']==68) & (df_all['Address #']==2400)]

Unnamed: 0,#Units,Address #,Apt.#,Commodity,Day,Hill or Flat?,I or C?,Locked,Meandor,Neighborhood,Route,Street,Street Sweeping,Time,Tipper,Total Volume,Toter Size,Truck #,Type,Vehicle Type
10,1,2400,,GB,2,F,C,,,,68,San Bruno,N,27,2,64.0,64,14562,C,
11,1,2400,,GB,2,F,C,,,,68,San Bruno,N,48,2,64.0,64,14562,C,
12,2,2400,,GB,2,F,C,,,,68,San Bruno,N,43,2,64.0,(2)32,14562,C,


In [None]:
df_copy_alphabetized.loc[(df_copy['Day']==2) & (df_copy['Route']==68) & (df_copy['Address #']==2400)]

Unnamed: 0,#Units,16 gal,20 gal,32 gal,64 gal,96 gal,Additional Notes,Address #,Apt.#,Block Time,CCAN,Cardboard Box,Commodity,Common Notes,Data Collector,Date,Day,Even/Odd,GlobalID,Hill or Flat?,I or C?,Locked,Meandor,Neighborhood,Number of Stops,Route,Sequence #,Street,Street Sweeping,Time,Tipper,Trash Bags,Truck #,Unnamed: 2,Vehicle Type,x,y
2207,1.0,0.0,0.0,1.0,0.0,,,2400,,,0,0,Garbage/Compost,,Norma,,2.0,,,Flat,,,C,,0.0,68.0,,San Bruno,N,27,2.0,,14562.0,,,,
2208,1.0,0.0,0.0,1.0,0.0,,,2400,,,0,0,Garbage/Compost,,Norma,,2.0,,,Flat,,,C,,0.0,68.0,,San Bruno,N,48,2.0,,14562.0,,,,
3574,2.0,0.0,2.0,0.0,0.0,,,2400,,,0,0,Garbage/Compost,,Norma,,2.0,,,Flat,,,C,,0.0,68.0,,San Bruno,N,43,2.0,,14562.0,,,,


In [None]:
df_copy['Meandor'].value_counts()

C     3559
I      290
I       19
Y        3
Name: Meandor, dtype: int64

In [None]:
df_copy['I or C?'].value_counts()

C     1098
I      181
CL      13
IC      11
Name: I or C?, dtype: int64

In [None]:
df_left_merge_all_copy=df_copy.merge(df_all,left_on=['Day','Route','Time'],right_on=['Day','Route','Time(Sec)'],how='left')
df_left_merge_all_copy[['16 gal','20 gal','32 gal','64 gal','96 gal','Toter Size']]


Unnamed: 0,16 gal,20 gal,32 gal,64 gal,96 gal,Toter Size
0,0.0,0.0,1.0,2.0,2.0,
1,0.0,0.0,0.0,0.0,1.0,
2,0.0,0.0,0.0,0.0,1.0,
3,0.0,0.0,1.0,0.0,1.0,
4,0.0,0.0,1.0,0.0,0.0,
...,...,...,...,...,...,...
12304,1.0,0.0,6.0,2.0,,
12305,3.0,1.0,3.0,0.0,,
12306,0.0,2.0,1.0,0.0,,
12307,1.0,0.0,1.0,0.0,,


In [None]:
df_copy.columns

Index(['Date', 'Day', 'Unnamed: 2', 'Route', 'Truck #', 'Vehicle Type',
       'Commodity', 'Tipper', 'Sequence #', 'Address #', 'Apt.#', 'Street',
       'Even/Odd', 'Meandor', 'I or C?', 'Time', 'Block Time', '#Units',
       'Number of Stops', '16 gal', '20 gal', '32 gal', '64 gal', '96 gal',
       'CCAN', 'Cardboard Box', 'Trash Bags', 'Neighborhood', 'Hill or Flat?',
       'Street Sweeping', 'Locked', 'Common Notes', 'Additional Notes',
       'GlobalID', 'x', 'y', 'Data Collector'],
      dtype='object')

In [None]:
df_copy_test=df_copy[['Day','Route','Address #','Apt.#','16 gal', '20 gal', '32 gal', '64 gal', '96 gal']].drop_duplicates(keep=False)
df_all_test=df_all[['Day','Route','Address #','Apt.#','Toter Size']].drop_duplicates(keep=False)

In [None]:
df_merged=df_copy_test.merge(df_all_test,on=['Day','Route','Address #','Apt.#'],how='left')
#df_merged[df_merged['Toter Size'].notnull()]
df_merged

Unnamed: 0,Day,Route,Address #,Apt.#,16 gal,20 gal,32 gal,64 gal,96 gal,Toter Size
0,2.0,912.0,5128/5132,,0.0,0.0,1.0,2.0,2.0,
1,2.0,912.0,5620,,0.0,0.0,0.0,0.0,1.0,
2,2.0,912.0,1947,,0.0,0.0,0.0,0.0,1.0,
3,2.0,912.0,1919,,0.0,0.0,1.0,0.0,1.0,
4,2.0,912.0,1909,,0.0,0.0,1.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...
2915,5.0,17.0,1200,,1.0,0.0,6.0,2.0,,
2916,5.0,17.0,1200,,3.0,1.0,3.0,0.0,,
2917,5.0,17.0,1200,,0.0,2.0,1.0,0.0,,
2918,5.0,17.0,3300,,1.0,0.0,1.0,0.0,,


In [None]:
df_copy.columns,df_all.columns

(Index(['Date', 'Day', 'Unnamed: 2', 'Route', 'Truck #', 'Vehicle Type',
        'Commodity', 'Tipper', 'Sequence #', 'Address #', 'Apt.#', 'Street',
        'Even/Odd', 'Meandor', 'I or C?', 'Time', 'Block Time', '#Units',
        'Number of Stops', '16 gal', '20 gal', '32 gal', '64 gal', '96 gal',
        'CCAN', 'Cardboard Box', 'Trash Bags', 'Neighborhood', 'Hill or Flat?',
        'Street Sweeping', 'Locked', 'Common Notes', 'Additional Notes',
        'GlobalID', 'x', 'y', 'Data Collector'],
       dtype='object'),
 Index(['Day', 'Route', 'Truck # ', 'Vehicle Type', 'Steep/Flat',
        'Street Sweeping ', 'Time', 'Toter (unit) ', 'Toter Size',
        'Total Volume', 'Commodity', 'Tipper', 'Neighborhood ', 'I or C?',
        'Address #', 'Apt.#', 'Street', 'Meandor', 'Key Code?', 'Type'],
       dtype='object'))

In [None]:
df_cop

In [None]:
df_all_grouped=df_all.groupby(['Day','Route','Address #']).size().reset_index().rename(columns={0:'count'}).sort_values('count')
df_all_grouped[df_all_grouped['count']!=1]

Unnamed: 0,Day,Route,Address #,count
122,4,59,1107,2
55,2,68,2600,2
52,2,68,2400,3
97,4,59,101,3
50,2,68,140,4
91,4,37,700,5
53,2,68,2500,5
89,4,37,300,6
87,4,37,0,7
129,4,958,na,7


In [None]:
# Test to see if a combination is a unique key 
df_all_grouped=df_all.groupby(['Day','Route','Address #']).size().reset_index().rename(columns={0:'count'})
df_copy_grouped=df_copy.groupby(['Day','Route','Address #']).size().reset_index().rename(columns={0:'count'})
df_merged_grouped=df_all_grouped.merge(df_copy_grouped,how='inner',on=['Day','Route','Address #'])


In [None]:
# These are the Day, Route, Address combinations that are present in both Copy and All
df_merged_grouped

Unnamed: 0,Day,Route,Address #,count_x,count_y
0,1,47,1,1,1
1,1,47,8,1,1
2,1,47,23,1,1
3,1,47,105,1,1
4,1,47,119,1,1
5,1,47,125,1,1
6,1,47,150,1,1
7,1,47,161,1,1
8,1,47,166,1,1
9,1,47,168,1,1


In [None]:
df_new_copy=df_merged_grouped.merge(df_copy,on=['Day','Route','Address #'],how='left')

df_new_copy=df_new_copy[['Day','Route','Address #','16 gal', '20 gal', '32 gal', '64 gal', '96 gal']]
df_new_copy=df_new_copy.merge(df_all,on=['Day','Route','Address #'],how='left')
df_new_copy[['Day','Route','Address #','16 gal', '20 gal', '32 gal', '64 gal', '96 gal','Toter Size']]

Unnamed: 0,Day,Route,Address #,16 gal,20 gal,32 gal,64 gal,96 gal,Toter Size
0,1,47,1,0.0,1.0,0.0,1.0,,3296
1,1,47,8,0.0,0.0,1.0,3.0,,"64,(3)96"
2,1,47,23,0.0,1.0,0.0,1.0,,3296
3,1,47,105,0.0,0.0,2.0,1.0,,"(2)64,96"
4,1,47,119,0.0,0.0,0.0,1.0,,96
...,...,...,...,...,...,...,...,...,...
1190,5,100,888,0.0,0.0,0.0,1.0,,96
1191,5,100,900,0.0,0.0,3.0,2.0,,"(3)64,(2)96"
1192,5,100,907,0.0,1.0,0.0,0.0,,32
1193,5,100,929,0.0,0.0,0.0,1.0,,96


In [None]:
df_all.columns

Index(['Day', 'Route', 'Truck #', 'Vehicle Type', 'Hill or Flat?',
       'Street Sweeping', 'Time', '#Units', 'Toter Size', 'Total Volume',
       'Commodity', 'Tipper', 'Neighborhood', 'I or C?', 'Address #', 'Apt.#',
       'Street', 'Meandor', 'Locked', 'Type'],
      dtype='object')

# Examine the duplicated rows that Shruti found 

In [None]:
df_all.loc[(df_all['Day']==4) & (df_all['Street']=='Prague') & (df_all['Address #']==700)]

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Hill or Flat?,Street Sweeping,Time,#Units,Toter Size,Total Volume,Commodity,Tipper,Neighborhood,I or C?,Address #,Apt.#,Street,Meandor,Locked,Type
1174,4,37,,,F,N,33,2,2032,52.0,GB,2,Excelsior,C,700,,Prague,,,
1175,4,37,,,F,N,36,2,(2)20,40.0,GB,2,Excelsior,C,700,,Prague,,,
1176,4,37,,,F,N,39,2,2032,52.0,GB,2,Excelsior,C,700,,Prague,,,
1177,4,37,,,F,N,41,2,(2)64,128.0,GB,2,Excelsior,C,700,,Prague,,,


In [None]:
df_all.loc[(df_all['Day']==4) & (df_all['Street']=='South Hill ') & (df_all['Address #']==0)]

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Hill or Flat?,Street Sweeping,Time,#Units,Toter Size,Total Volume,Commodity,Tipper,Neighborhood,I or C?,Address #,Apt.#,Street,Meandor,Locked,Type
1178,4,37,,,F,N,86,4,"16,(3)32",112.0,GB,2,Excelsior,C,0,,South Hill,,,
1179,4,37,,,F,N,66,3,"(2)32,64",128.0,GB,2,Excelsior,C,0,,South Hill,,,
1180,4,37,,,F,N,33,2,2032,52.0,GB,2,Excelsior,C,0,,South Hill,,,
1181,4,37,,,F,N,62,4,"16,20,(2)32",100.0,GB,2,Excelsior,C,0,,South Hill,,,
1182,4,37,,,F,N,102,5,"16,20,(2)32 + 1 bag",,GB,2,Excelsior,C,0,,South Hill,,,
1183,4,37,,,F,N,149,5,"16,32,(3)64",240.0,GB,2,Excelsior,C,0,,South Hill,,,
1184,4,37,,,F,N,39,2,2032,52.0,GB,2,Excelsior,C,0,,South Hill,,,


In [None]:
df_all.loc[(df_all['Day']==2) & (df_all['Street']=='Wayland')]

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Hill or Flat?,Street Sweeping,Time,#Units,Toter Size,Total Volume,Commodity,Tipper,Neighborhood,I or C?,Address #,Apt.#,Street,Meandor,Locked,Type
43,2,68,14562,,F,N,118,6,"(2)16,(4)32",160.0,GB,2,,C,140,,Wayland,,,R
44,2,68,14562,,F,N,79,3,(3)32,96.0,GB,2,,C,140,,Wayland,,,R
45,2,68,14562,,F,N,46,2,(2)32,64.0,GB,2,,C,140,,Wayland,,,R
46,2,68,14562,,F,N,45,2,(2)32,64.0,GB,2,,C,140,,Wayland,,,R


In [None]:
df_all.loc[(df_all['Day']==2) & (df_all['Street']=='San Bruno') & (df_all['Address #']==2500)]

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Hill or Flat?,Street Sweeping,Time,#Units,Toter Size,Total Volume,Commodity,Tipper,Neighborhood,I or C?,Address #,Apt.#,Street,Meandor,Locked,Type
8,2,68,14562,,F,N,106,5,"(2)16,(2)32,96",192.0,GB,2,,C,2500,,San Bruno,,,C
9,2,68,14562,,F,N,119,2,(2)64 + blade,,GB,2,,C,2500,,San Bruno,,,C
14,2,68,14562,,F,N,208,4,"32,(3)64",224.0,GB,2,,C,2500,,San Bruno,,,C
15,2,68,14562,,F,N,61,2,(2)32,64.0,GB,2,,C,2500,,San Bruno,,,C
42,2,68,14562,,F,,156,5,"(2)32,64,(2)96",320.0,GB,2,,C,2500,,San Bruno,,,R


In [None]:
# Is apt # missing in copy too? YES. It is missing in all and copy 
df_copy.loc[(df_copy['Day']==2) & (df_copy['Street']=='San Bruno') & (df_copy['Address #']==2500)]

Unnamed: 0,Date,Day,Unnamed: 2,Route,Truck #,Vehicle Type,Commodity,Tipper,Sequence #,Address #,Apt.#,Street,Even/Odd,Meandor,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,Cardboard Box,Trash Bags,Neighborhood,Hill or Flat?,Street Sweeping,Locked,Common Notes,Additional Notes,GlobalID,x,y,Data Collector
3573,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2500,,San Bruno,,C,,119,,2.0,0.0,0.0,0.0,2.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
3575,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2500,,San Bruno,,C,,61,,2.0,0.0,0.0,2.0,0.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
4632,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2500,,San Bruno,,C,,208,,4.0,0.0,0.0,1.0,3.0,0.0,,0,0,,,Flat,N,,,,,,,Norma
4801,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2500,,San Bruno,,C,,156,,5.0,0.0,0.0,2.0,1.0,2.0,,0,0,,,Flat,,,,,,,,Norma
4803,,2.0,,68.0,14562.0,,Garbage/Compost,2.0,,2500,,San Bruno,,C,,106,,0.0,,,,,,,0,0,,,Flat,N,,,,,,,Norma


In [None]:
df_all.loc[(df_all['Day']==2) & (df_all['Street']=='Woosley') & (df_all['Address #']==2800)]

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Hill or Flat?,Street Sweeping,Time,#Units,Toter Size,Total Volume,Commodity,Tipper,Neighborhood,I or C?,Address #,Apt.#,Street,Meandor,Locked,Type
27,2,68,14562,,F,N,52,2,(2)96,192.0,GB,2,,C,2800,,Woosley,,,C
28,2,68,14562,,F,N,57,2,(2)64,128.0,GB,2,,C,2800,,Woosley,,,C
29,2,68,14562,,F,N,50,2,3264,96.0,GB,2,,C,2800,,Woosley,,,C
30,2,68,14562,,F,N,128,6,"20,(5)32",180.0,GB,2,,C,2800,,Woosley,,,C
31,2,68,14562,,F,N,85,4,"20,(3)32",116.0,GB,2,,C,2800,,Woosley,,,C
32,2,68,14562,,F,N,127,6,"(5)32,64",224.0,GB,2,,C,2800,,Woosley,,,C
33,2,68,14562,,F,N,47,2,2032,52.0,GB,2,,C,2800,,Woosley,,,C
34,2,68,14562,,F,N,141,6,"16,(5)32",176.0,GB,2,,C,2800,,Woosley,,,C


In [None]:
df_all.loc[(df_all['Day']==2) & (df_all['Street']=='San Bruno') & (df_all['Address #']==2900)]

Unnamed: 0,Day,Route,Truck #,Vehicle Type,Hill or Flat?,Street Sweeping,Time,#Units,Toter Size,Total Volume,Commodity,Tipper,Neighborhood,I or C?,Address #,Apt.#,Street,Meandor,Locked,Type
0,2,68,14562,,F,N,101,6,"16,(5)32",176.0,GB,2,,C,2900,,San Bruno,,,C
1,2,68,14562,,F,N,24,2,(2)32,64.0,GB,2,,C,2900,,San Bruno,,,C
2,2,68,14562,,F,N,129,5,"16,(4)32",144.0,GB,2,,C,2900,,San Bruno,,,C
3,2,68,14562,,F,N,21,1,32,32.0,GB,2,,C,2900,,San Bruno,,,C
36,2,68,14562,,F,,92,5,"(3)32,(2)64",224.0,GB,2,,C,2900,,San Bruno,,,R
37,2,68,14562,,F,,152,4,"(3)32,64",160.0,GB,2,,C,2900,,San Bruno,,,R
55,2,68,14562,,F,N,64,3,"(2)16,32",64.0,GB,2,,C,2900,,San Bruno,,,R
57,2,68,14562,,F,N,38,2,3264,96.0,GB,2,,C,2900,,San Bruno,,,R


In [None]:
df_copy.loc[df_copy['96 gal']=='nan']

Unnamed: 0,Date,Day,Unnamed: 2,Route,Truck #,Vehicle Type,Commodity,Tipper,Sequence #,Address #,Apt.#,Street,Even/Odd,Meandor,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,Cardboard Box,Trash Bags,Neighborhood,Hill or Flat?,Street Sweeping,Locked,Common Notes,Additional Notes,GlobalID,x,y,Data Collector


In [None]:
df_copy['96 gal'].value_counts()

0.0    1043
1.0     155
2.0      48
3.0      11
Name: 96 gal, dtype: int64

In [None]:
df_copy['96 gal'].isna().sum()

3914

In [None]:
# Assume CL is CI 
df_file.loc[df_file['I or C?']=='CL']

Unnamed: 0,Date,Day,Route,Truck #,Vehicle Type,Commodity,Tipper,Sequence #,Address #,Apt.#,Street,Even/Odd,Meandor,I or C?,Time,Block Time,#Units,Number of Stops,16 gal,20 gal,32 gal,64 gal,96 gal,CCAN,1 yd,1.5 yd,2 yd,3 yd,4 yd,5 yd,6 yd,Cardboard Box,Trash Bags,Hill or Flat?,Street Sweeping,Locked,Common Notes,Additional Notes,GlobalID,x,y
267,3/13/2020,5,912,14611,S-HEIL,Recycle,2,92,5600,,Geary St,,,CL,105,,3.0,1,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,Flat,N,,,,4a5c04cf-077d-4062-bb47-9b9db350fcde,-122.480026,37.780399
653,3/09/2020,1,1,14393,S-HEIL,Garbage/Compost,2,56,372,,Point Lobos,,,CL,38,,2.0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,Flat,N,,,Lock on containers,b456f512-3673-40b2-9938-d490e1a96ed2,-122.507417,37.78012
790,3/09/2020,1,1,14393,S-HEIL,Garbage/Compost,2,193,970,,47th Ave,,,CL,133,,3.0,1,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,Flat,Y,key,,"Lock on 1/3 cans, all used to be locked but no...",e7bdc98b-c8ee-4e2d-b8d8-600931286226,-122.507981,37.769587
1177,3/11/2020,3,5,14391,S-HEIL,Garbage/Compost,2,185,6101,,Geary St,,,CL,28,,1.0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Flat,Y,key,,,e9e6fe0b-7d3c-4944-8134-7983da351175,-122.485,37.779846
1178,3/11/2020,3,5,14391,S-HEIL,Garbage/Compost,2,186,5901,,Geary St,,,CL,25,,1.0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Flat,Y,key,,,1d68ba90-5323-479b-8b37-e2a0574bc8d0,-122.482679,37.780175
1184,3/11/2020,3,5,14391,S-HEIL,Garbage/Compost,2,192,501,,25th Ave,,,CL,24,,1.0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,Flat,Y,key,,,cd7bb0c6-a04d-4cde-ab51-fcb912acdc41,-122.484667,37.779899
1197,3/9/2020,1,2,14608,S-HEIL,Garbage/Compost,2,8,3911,,Balboa St,,,CL,49,,1.0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Flat,Y,key,,,042e263c-e9ce-4ad1-acd2-515422aaa954,-122.500637,37.775669
1198,3/9/2020,1,2,14608,S-HEIL,Garbage/Compost,2,9,3701,,Balboa St,,,CL,33,,1.0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Flat,N,key,,,ff1e66b7-c2af-4cba-a2df-ffcbe0d9b7b1,-122.498495,37.775687
1202,3/9/2020,1,2,14608,S-HEIL,Garbage/Compost,2,13,3601,,Balboa St,,,CL,21,,1.0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Flat,N,key,,,92ee78e3-b4b5-480b-bb70-547cee87c093,-122.497441,37.775663
1209,3/9/2020,1,2,14608,S-HEIL,Garbage/Compost,2,20,3401,,Balboa St,,,CL,25,,1.0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Flat,N,key,,,f3c46685-b1cd-4f91-a9cd-d71daa014c23,-122.495291,37.775797


In [None]:
df_file.columns

Index(['Date', 'Day', 'Route', 'Truck #', 'Vehicle Type', 'Commodity',
       'Tipper', 'Sequence #', 'Address #', 'Apt.#', 'Street', 'Even/Odd',
       'Meandor', 'I or C?', 'Time', 'Block Time', '#Units', 'Number of Stops',
       '16 gal', '20 gal', '32 gal', '64 gal', '96 gal', 'CCAN', '1 yd',
       '1.5 yd', '2 yd', '3 yd', '4 yd ', '5 yd ', '6 yd ', 'Cardboard Box',
       'Trash Bags', 'Hill or Flat?', 'Street Sweeping', 'Locked',
       'Common Notes', 'Additional Notes', 'GlobalID', 'x', 'y'],
      dtype='object')

In [None]:
df_copy['Data Collector'].value_counts()

Norma    3915
Name: Data Collector, dtype: int64

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=938c6ad9-491d-4307-bf8a-c751a244ce4f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>