You have to work on the Dogs adoptions dataset.

It contains three files:

- dogs.csv, shortly dogs
- dogTravel.csv, shortly travels
- NST-EST2021-POP.csv

Notes
1. It is mandatory to use GitHub for developing the project.
2. The project must be a jupyter notebook.
3. There is no restriction on the libraries that can be used, nor on the Python version.
4. All questions on the project must be asked in a public channel on Zulip.
5. At most 3 students can be in each group. You must create the groups by yourself.
6. You do not have to send me the project before the discussion.

# Import libraries and datasets

To carry out the project I use the libraries Pandas and Numpy

In [1]:
import pandas as pd
import numpy as np

To import the datasets I use the read_csv function

In [2]:
dogs = pd.read_csv("D:\DATASCIENCE\PRIMO ANNO\FOUNDATION OF COMPUTER SCIENCE\dogs.csv", header = 0) 
dogs.head()

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/...,Dog,Dog,American Staffordshire Terrier,Mixed Breed,True,False,White / Cream,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter ...,70,124.81
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/...,Dog,Dog,Pit Bull Terrier,Mixed Breed,True,False,Brown / Chocolate,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really...,49,122.07
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/n...,Dog,Dog,Shepherd,,False,False,Brindle,...,Mesquite,NV,89027,US,89009,2019-09-20,Dog,Approx 2 years old.\n Did I catch your eye? I ...,87,281.51
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/n...,Dog,Dog,German Shepherd Dog,,False,False,,...,Pahrump,NV,89048,US,89009,2019-09-20,Dog,,62,145.83
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv...,Dog,Dog,Dachshund,,False,False,,...,Henderson,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets alon...,93,241.09


In [3]:
dogtravel = pd.read_csv("D:\DATASCIENCE\PRIMO ANNO\FOUNDATION OF COMPUTER SCIENCE\dogTravel.csv", header = 0) 
dogtravel.head()

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there
0,0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made hi...,Arkansas,,,
1,1,44698509,Groveland,FL,Duke is an almost 2 year old Potcake from Abac...,Abacos,Bahamas,,
2,2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star lookin...,Adam,Maryland,,
3,3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from an...,Adaptil,,True,
4,4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her peo...,Afghanistan,,,


In [4]:
countries = pd.read_csv("D:/DATASCIENCE/PRIMO ANNO/FOUNDATION OF COMPUTER SCIENCE/country.csv", names = ["country", "pop"]) 
countries.head()

Unnamed: 0,country,pop
0,Alabama,5.024.279
1,Alaska,733.391
2,Arizona,7.151.502
3,Arkansas,3.011.524
4,California,39.538.223


# Check the datasets

I verify that the attribute types are as expected by possibly correcting anomalies in the import step

In [5]:
dogs.dtypes

id                   int64
org_id              object
url                 object
type.x              object
species             object
breed_primary       object
breed_secondary     object
breed_mixed           bool
breed_unknown         bool
color_primary       object
color_secondary     object
color_tertiary      object
age                 object
sex                 object
size                object
coat                object
fixed                 bool
house_trained         bool
declawed           float64
special_needs         bool
shots_current         bool
env_children        object
env_dogs            object
env_cats            object
name                object
status              object
posted              object
contact_city        object
contact_state       object
contact_zip         object
contact_country     object
stateQ              object
accessed            object
type.y              object
description         object
stay_duration        int64
stay_cost          float64
d

In [6]:
dogtravel.dtypes

index             int64
id                int64
contact_city     object
contact_state    object
description      object
found            object
manual           object
remove           object
still_there      object
dtype: object

In [7]:
countries.dtypes

country    object
pop        object
dtype: object

# correction of data type

Since operations will be required on the 'population' attribute of the countries dataset I transform it to float type (decimal number)

In [8]:
countries['pop'] = countries['pop'].str.replace(".", "")
countries['pop'] = countries['pop'].astype(float)
countries

  countries['pop'] = countries['pop'].str.replace(".", "")


Unnamed: 0,country,pop
0,Alabama,5024279.0
1,Alaska,733391.0
2,Arizona,7151502.0
3,Arkansas,3011524.0
4,California,39538223.0
5,Colorado,5773714.0
6,Connecticut,3605944.0
7,Delaware,989948.0
8,District of Columbia,689545.0
9,Florida,21538187.0


# 1. Extract all dogs with status that is not adoptable

In [9]:
not_adoptable_dogs = dogs[dogs["status"] != "Adoptable"]
not_adoptable_dogs

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/...,Dog,Dog,American Staffordshire Terrier,Mixed Breed,True,False,White / Cream,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter ...,70,124.81
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/...,Dog,Dog,Pit Bull Terrier,Mixed Breed,True,False,Brown / Chocolate,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really...,49,122.07
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/n...,Dog,Dog,Shepherd,,False,False,Brindle,...,Mesquite,NV,89027,US,89009,2019-09-20,Dog,Approx 2 years old.\n Did I catch your eye? I ...,87,281.51
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/n...,Dog,Dog,German Shepherd Dog,,False,False,,...,Pahrump,NV,89048,US,89009,2019-09-20,Dog,,62,145.83
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv...,Dog,Dog,Dachshund,,False,False,,...,Henderson,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets alon...,93,241.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58175,44605893,WY20,https://www.petfinder.com/dog/tren-44605893/wy...,Dog,Dog,Border Collie,,False,False,"Tricolor (Brown, Black, & White)",...,Lander,WY,82520,US,WY,2019-09-20,Dog,"Due to the small size of our volunteer base, w...",100,324.34
58176,44457061,WY24,https://www.petfinder.com/dog/harley-44457061/...,Dog,Dog,Australian Shepherd,Australian Cattle Dog / Blue Heeler,True,False,,...,Riverton,WY,82501,US,WY,2019-09-20,Dog,,65,245.90
58177,42865848,WY20,https://www.petfinder.com/dog/echo-42865848/wy...,Dog,Dog,Border Collie,,False,False,,...,Glenrock,WY,82637,US,WY,2019-09-20,Dog,"Due to the small size of our volunteer base, w...",100,184.06
58178,42734734,WY24,https://www.petfinder.com/dog/simon-42734734/w...,Dog,Dog,Boxer,Mixed Breed,True,False,,...,Riverton,WY,82501,US,WY,2019-09-20,Dog,,58,61.05


# 2. For each (primary) breed, determine the number of dogs

In [10]:
breed_counts = dogs.groupby("breed_primary")["id"].count().reset_index(name="count")
breed_counts

Unnamed: 0,breed_primary,count
0,Affenpinscher,17
1,Afghan Hound,4
2,Airedale Terrier,19
3,Akbash,3
4,Akita,181
...,...,...
211,Wirehaired Pointing Griffon,1
212,Wirehaired Terrier,60
213,Xoloitzcuintli / Mexican Hairless,11
214,Yellow Labrador Retriever,158


# 3. For each (primary) breed, determine the ratio between the number of dogs of Mixed Breed and those not of Mixed Breed. Hint: look at the secondary_breed

In [11]:
breed_counts = dogs.groupby(["breed_primary", "breed_mixed"])["id"].count().reset_index(name="count")

breed_counts_pivot = breed_counts.pivot(index="breed_primary", columns="breed_mixed", values="count")

breed_counts_pivot["ratio_mixed"] = breed_counts_pivot[True] / breed_counts_pivot[False]
breed_counts_pivot

breed_mixed,False,True,ratio_mixed
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Affenpinscher,12.0,5.0,0.416667
Afghan Hound,,4.0,
Airedale Terrier,2.0,17.0,8.500000
Akbash,1.0,2.0,2.000000
Akita,98.0,83.0,0.846939
...,...,...,...
Wirehaired Pointing Griffon,,1.0,
Wirehaired Terrier,15.0,45.0,3.000000
Xoloitzcuintli / Mexican Hairless,6.0,5.0,0.833333
Yellow Labrador Retriever,36.0,122.0,3.388889


# 4. For each (primary) breed, determine the earliest and the latest posted timestamp

In [12]:
breed_posted = dogs.groupby("breed_primary")["posted"].agg([min, max]).reset_index()
breed_posted.columns = ["breed_primary", "earliest_posted", "latest_posted"]
breed_posted

Unnamed: 0,breed_primary,earliest_posted,latest_posted
0,Affenpinscher,2012-03-08T10:27:33+0000,2019-09-14T10:10:51+0000
1,Afghan Hound,2017-06-29T23:28:51+0000,2019-07-27T00:38:48+0000
2,Airedale Terrier,2014-06-13T12:59:36+0000,2019-09-19T18:40:39+0000
3,Akbash,2019-07-21T00:35:59+0000,2019-08-23T17:11:04+0000
4,Akita,2012-03-03T09:31:08+0000,2019-09-20T15:19:57+0000
...,...,...,...
211,Wirehaired Pointing Griffon,2016-06-29T20:03:55+0000,2016-06-29T20:03:55+0000
212,Wirehaired Terrier,2012-11-27T14:07:54+0000,2019-09-19T22:52:45+0000
213,Xoloitzcuintli / Mexican Hairless,2007-02-01T00:00:00+0000,2019-09-08T11:15:54+0000
214,Yellow Labrador Retriever,2010-05-31T00:00:00+0000,Nashville


# 5. For each state, compute the sex imbalance, that is the difference between male and female dogs. In which state this imbalance is largest?

In [13]:
state_imbalance = dogs.groupby("contact_state")["sex"].value_counts().unstack().reset_index()
state_imbalance = state_imbalance.fillna(0)
state_imbalance["imbalance"] = state_imbalance["Male"] - state_imbalance["Female"]
state_imbalance_ordinato = state_imbalance.sort_values('imbalance', ascending=False)
state_imbalance_ordinato

sex,contact_state,Female,Male,Unknown,imbalance
58,OH,1231.0,1439.0,0.0,208.0
37,IN,850.0,1027.0,0.0,177.0
69,VA,1450.0,1608.0,0.0,158.0
42,MD,679.0,814.0,0.0,135.0
66,TN,825.0,944.0,0.0,119.0
...,...,...,...,...,...
53,NH,172.0,163.0,0.0,-9.0
29,DC,176.0,160.0,0.0,-16.0
43,ME,287.0,258.0,0.0,-29.0
47,MS,275.0,235.0,0.0,-40.0


In [14]:
largest_imbalance = state_imbalance.loc[abs(state_imbalance["imbalance"]).idxmax()]

print("The state with the largest sex imbalance is", largest_imbalance["contact_state"], 
      "with a sex imbalance of", largest_imbalance["imbalance"])

The state with the largest sex imbalance is OH with a sex imbalance of 208.0


# 6. For each pair (age, size), determine the average duration of the stay and the average cost of stay

In [15]:
age_size_stats = dogs.groupby(["age", "size"]).agg({"stay_duration": "mean", "stay_cost": "mean"}).reset_index()
age_size_stats.columns = ["age", "size", "avg_stay_duration", "avg_stay_cost"]
age_size_stats

Unnamed: 0,age,size,avg_stay_duration,avg_stay_cost
0,Adult,Extra Large,89.015414,232.591561
1,Adult,Large,89.531943,238.661141
2,Adult,Medium,89.421036,238.258977
3,Adult,Small,89.407479,238.974838
4,Baby,Extra Large,87.032967,237.180879
5,Baby,Large,89.701564,238.698827
6,Baby,Medium,89.577668,237.108131
7,Baby,Small,89.958291,239.08381
8,Senior,Extra Large,88.861111,235.232361
9,Senior,Large,88.984298,237.507364


# 7. Find the dogs involved in at least 3 travels. Also list the breed of those dogs

In [16]:
dog_travel_merge = dogs.merge(dogtravel, on="id")
 
travel_counts = dog_travel_merge.groupby("id")["index"].nunique().reset_index()

ids_3travels = travel_counts.loc[travel_counts["index"] >= 3, "id"]

dogs_3travels = dogs.loc[dogs["id"].isin(ids_3travels), ["id", "breed_primary"]].drop_duplicates()

dogs_3travels

Unnamed: 0,id,breed_primary
1159,45642530,Jindo
6835,46039420,Border Collie
8526,40036107,Pit Bull Terrier
10681,45851842,Labrador Retriever
10803,45841145,Mixed Breed
...,...,...
56850,41144335,Chihuahua
56864,40103682,Rat Terrier
56875,38664932,Pit Bull Terrier
56879,38495992,Pit Bull Terrier


# 8. Fix the travels table so that the correct state is computed from the manual and the found fields. If manual is not missing, then it overrides what is stored in found

In [17]:
dogtravel ["found"] = dogtravel["found"].fillna("")
dogtravel ["manual"] = dogtravel["manual"].fillna("")

dogtravel["state"] = dogtravel.apply(lambda x: x["manual"] if x["manual"] != "" else x["found"].split(",")[-1].strip(), axis=1)

dogtravel

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there,state
0,0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made hi...,Arkansas,,,,Arkansas
1,1,44698509,Groveland,FL,Duke is an almost 2 year old Potcake from Abac...,Abacos,Bahamas,,,Bahamas
2,2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star lookin...,Adam,Maryland,,,Maryland
3,3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from an...,Adaptil,,True,,Adaptil
4,4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her peo...,Afghanistan,,,,Afghanistan
...,...,...,...,...,...,...,...,...,...,...
6189,6189,40492179,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,,True,,WV
6190,6190,45799729,Eagle Mountain,UT,Shiny is an approximately 4-6-year-old spayed ...,Wyoming,,,,Wyoming
6191,6191,34276515,Newnan,GA,Yanni is a Male Great Pyrenees that we rescue...,Yazmin,,True,,Yazmin
6192,6192,44519341,Dayton,OH,Callie is a 14 year old Chihuahua whose owner ...,Young,Ohio,,,Ohio


# 9. For each state, compute the ratio between the number of travels and the population

In [18]:

# Group the DataFrame by contact state and count the number of rows in each group
state_counts = dogtravel.groupby('contact_state').size()

# Create a new DataFrame that contains the state counts as a column
state_counts_df = pd.DataFrame({'numberoftravels': state_counts})

# Merge the new DataFrame with the original DataFrame on contact state
result = pd.merge(dogtravel, state_counts_df, how='left', on='contact_state')

# Print the results
print(result)


      index        id    contact_city contact_state  \
0         0  44520267           Anoka            MN   
1         1  44698509       Groveland            FL   
2         2  45983838       Adamstown            MD   
3         3  44475904     Saint Cloud            MN   
4         4  43877389          Pueblo            CO   
...     ...       ...             ...           ...   
6189   6189  40492179        Fairmont            WV   
6190   6190  45799729  Eagle Mountain            UT   
6191   6191  34276515          Newnan            GA   
6192   6192  44519341          Dayton            OH   
6193   6193  36659999        New York            NY   

                                            description        found  \
0     Boris is a handsome mini schnauzer who made hi...     Arkansas   
1     Duke is an almost 2 year old Potcake from Abac...       Abacos   
2     Zac Woof-ron is a heartthrob movie star lookin...         Adam   
3     ~~Came in to the shelter as a transfer from a

In [19]:
# Define a mapping between state abbreviations and names
state_mapping = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}


# Replace the state abbreviations with their names
result['contact_state'] = result['contact_state'].replace(state_mapping)



In [20]:

# Merge the resulting dataframe with the country dataset on contact_state field
merged = result.merge(countries, left_on='contact_state', right_on='country')

# Print the resulting dataframe
merged

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there,state,numberoftravels,country,pop
0,0,44520267,Anoka,Minnesota,Boris is a handsome mini schnauzer who made hi...,Arkansas,,,,Arkansas,190,Minnesota,5706494.0
1,3,44475904,Saint Cloud,Minnesota,~~Came in to the shelter as a transfer from an...,Adaptil,,True,,Adaptil,190,Minnesota,5706494.0
2,312,45419523,Minneapolis,Minnesota,You can fill out an adoption application onlin...,Alabama,,,,Alabama,190,Minnesota,5706494.0
3,405,45502015,Blaine,Minnesota,Floyd is back and looking for his forever home...,Arizona,,,,Arizona,190,Minnesota,5706494.0
4,436,45724780,Plymouth,Minnesota,You can fill out an adoption application onlin...,Arkansas,,True,,Arkansas,190,Minnesota,5706494.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6065,2176,45906452,New Iberia,Louisiana,Will you allow me to be your shadow? I just gr...,Louisiana,,,True,Louisiana,5,Louisiana,4657757.0
6066,3260,44264372,Natchitoches,Louisiana,I was found in May of 2012 across from Nicky's...,Nicky,,True,,Nicky,5,Louisiana,4657757.0
6067,3704,46008478,Baton Rouge,Louisiana,Peanut was found wandering down a busy street...,Peanut,,True,,Peanut,5,Louisiana,4657757.0
6068,4044,40700634,Baker,Louisiana,Meet YoYo! This beautiful shepherd mix girl i...,S.,Louisiana,,,Louisiana,5,Louisiana,4657757.0


In [21]:
# Calculate the ratio of number/population and add it as a new column
merged['ratio'] = merged['numberoftravels'] / merged['pop']

# Print the results
merged[['country','numberoftravels', 'pop','ratio']]

Unnamed: 0,country,numberoftravels,pop,ratio
0,Minnesota,190,5706494.0,0.000033
1,Minnesota,190,5706494.0,0.000033
2,Minnesota,190,5706494.0,0.000033
3,Minnesota,190,5706494.0,0.000033
4,Minnesota,190,5706494.0,0.000033
...,...,...,...,...
6065,Louisiana,5,4657757.0,0.000001
6066,Louisiana,5,4657757.0,0.000001
6067,Louisiana,5,4657757.0,0.000001
6068,Louisiana,5,4657757.0,0.000001


In [22]:
from datetime import datetime

# 10. For each dog, compute the number of days from the posted day to the day of last access

In [27]:
dogs['posted'] = pd.to_datetime(dogs['posted'].dt.tz_localize(None), errors='coerce', format='%Y-%m-%d')
dogs['accessed'] = pd.to_datetime(dogs['accessed'],errors='coerce', format='%Y-%m-%d')


In [28]:
# Calculate time difference between "accessed" and "posted" columns
dogs["time_diff"] = dogs["accessed"] - dogs["posted"]
dogs["days_diff"] = dogs["time_diff"] / pd.Timedelta(days=1)
dogs[["id","accessed","posted","days_diff"]]


Unnamed: 0,id,accessed,posted,days_diff
0,46042150,2019-09-20,2019-09-20 16:37:59,-0.693044
1,46042002,2019-09-20,2019-09-20 16:24:57,-0.683993
2,46040898,2019-09-20,2019-09-20 14:10:11,-0.590405
3,46039877,2019-09-20,2019-09-20 10:08:22,-0.422477
4,46039306,2019-09-20,2019-09-20 06:48:30,-0.283681
...,...,...,...,...
58175,44605893,2019-09-20,2019-05-03 14:23:49,139.400127
58176,44457061,2019-09-20,2019-04-13 16:20:24,159.319167
58177,42865848,2019-09-20,2018-09-27 04:18:56,357.820185
58178,42734734,2019-09-20,2018-09-12 05:03:38,372.789144


# 11. Partition the dogs according to the number of weeks from the posted day to the day of last access.

In [29]:
dogs["weeks_diff"] = dogs["time_diff"] / pd.Timedelta(weeks=1)
dogs[["id","accessed","posted","weeks_diff"]]

Unnamed: 0,id,accessed,posted,weeks_diff
0,46042150,2019-09-20,2019-09-20 16:37:59,-0.099006
1,46042002,2019-09-20,2019-09-20 16:24:57,-0.097713
2,46040898,2019-09-20,2019-09-20 14:10:11,-0.084344
3,46039877,2019-09-20,2019-09-20 10:08:22,-0.060354
4,46039306,2019-09-20,2019-09-20 06:48:30,-0.040526
...,...,...,...,...
58175,44605893,2019-09-20,2019-05-03 14:23:49,19.914304
58176,44457061,2019-09-20,2019-04-13 16:20:24,22.759881
58177,42865848,2019-09-20,2018-09-27 04:18:56,51.117169
58178,42734734,2019-09-20,2018-09-12 05:03:38,53.255592


In [30]:
# Define the partition criteria
bins = [-float('inf'), 2, 4, 6, 8,10, 12, 14, 16, 18, 20, float('inf')]
labels = ['0-2', '2-4', '4-6', '6-8', '8-10', '10-12','12-14', '14-16','16-18', '18-20', '20+']

# Use pd.cut() to create a new column that partitions the dogs based on the number of weeks difference
dogs["weeks_partition"] = pd.cut(dogs["weeks_diff"], bins=bins, labels=labels)

dogs[["id","accessed","posted","weeks_diff", "weeks_partition"]]

Unnamed: 0,id,accessed,posted,weeks_diff,weeks_partition
0,46042150,2019-09-20,2019-09-20 16:37:59,-0.099006,0-2
1,46042002,2019-09-20,2019-09-20 16:24:57,-0.097713,0-2
2,46040898,2019-09-20,2019-09-20 14:10:11,-0.084344,0-2
3,46039877,2019-09-20,2019-09-20 10:08:22,-0.060354,0-2
4,46039306,2019-09-20,2019-09-20 06:48:30,-0.040526,0-2
...,...,...,...,...,...
58175,44605893,2019-09-20,2019-05-03 14:23:49,19.914304,18-20
58176,44457061,2019-09-20,2019-04-13 16:20:24,22.759881,20+
58177,42865848,2019-09-20,2018-09-27 04:18:56,51.117169,20+
58178,42734734,2019-09-20,2018-09-12 05:03:38,53.255592,20+


# 12. Find for duplicates in the dogs dataset. Two records are duplicates if they have (1) same breeds and sex, and (2) they share at least 90% of the words in the description field. Extra points if you find and implement a more refined for determining if two rows are duplicates.


In [31]:
import pandas as pd
from difflib import SequenceMatcher

# Assuming you have loaded the dogs dataset into a pandas DataFrame called "dogs_df"
# Replace "description" with the actual column name if it's different in your dataset

# Convert description column to lowercase and remove punctuation for accurate word comparison
dogs['description'] = dogs['description'].str.lower().replace('[^\w\s]', '', regex=True)


In [33]:

# Function to calculate the similarity between two strings based on word comparison
def get_similarity(string1, string2):
    words1 = set(string1.split())
    words2 = set(string2.split())
    common_words = words1.intersection(words2)
    similarity = len(common_words) / max(len(words1), len(words2))
    return similarity


In [34]:

# List to store duplicate records
duplicates = []


In [35]:

# Iterate over each record in the dogs dataset
for index, row in dogs.iterrows():
    breed = row['breed_primary']  # Assuming breed_primary contains the breed information
    sex = row['sex']
    description = row['description']
    
    # Compare with previous records to find duplicates
    for dup in duplicates:
        if dup['breed'] == breed and dup['sex'] == sex:
            similarity = get_similarity(dup['description'], description)
            if similarity >= 0.9:
                print(f"Duplicate found: Index {index} and {dup['index']}")
    

In [36]:

    # Add current record to the duplicates list
    duplicates.append({'index': index, 'breed': breed, 'sex': sex, 'description': description})


In [37]:
duplicates

[{'index': 58179,
  'breed': 'Labrador Retriever',
  'sex': 'Male',
  'description': nan}]