# More data cleaning

The Ravelry data has columns where each row value is a dictionary. To make things simpler, I'd rather break that information up before analysis than try to analyze while dealing with the dictionary structure.


In [81]:
import pandas as pd
from pandas.io.json import json_normalize
import ast


In [2]:
patterns_df = pd.read_csv('../data/df_patterns_clean.csv', low_memory = False)

In [3]:
pd.set_option('display.max_columns', 60)
patterns_df.head()

Unnamed: 0,patt_id,comments_count,created_at,currency,difficulty_average,difficulty_count,downloadable,favorites_count,free,gauge,gauge_divisor,gauge_pattern,name,price,projects_count,queued_projects_count,rating_average,rating_count,row_gauge,yardage,yardage_max,sizes_available,ravelry_download,yarn_weight_description,pattern_needle_sizes,packs,yarn_weight,craft,pattern_categories,pattern_attributes,pattern_type
0,17,27,2007/01/12 00:51:53 -0500,USD,4.903936,2363.0,True,11207,True,8.0,1.0,stockinette stitch,Pomatomus,,5079,4223,4.468085,2256.0,12.0,388.0,,"S, M, L",False,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","[{'id': 412, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","{'id': 2, 'name': 'Knitting', 'permalink': 'kn...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 23, 'permalink': 'top-cuff-down'}, {'i...","{'clothing': True, 'id': 2, 'name': 'Socks', '..."
1,29,70,2007/01/12 01:27:48 -0500,,2.647959,9283.0,True,23792,True,19.0,4.0,stockinette stitch,Clapotis,,23435,8219,4.431221,8520.0,25.0,820.0,,,False,Aran (8 wpi),"[{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_stee...","[{'id': 752, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...","{'id': 2, 'name': 'Knitting', 'permalink': 'kn...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 165, 'permalink': 'dropped-stitches'},...","{'clothing': True, 'id': 10, 'name': 'Shawl/Wr..."
2,40,58,2007/02/06 01:05:08 -0500,,2.939708,4445.0,True,13003,True,38.0,4.0,zigzag pattern,Jaywalker,,12071,4112,4.266636,4373.0,,465.0,,,True,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","[{'id': 430, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","{'id': 2, 'name': 'Knitting', 'permalink': 'kn...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 9, '...","{'clothing': True, 'id': 2, 'name': 'Socks', '..."
3,54,100,2007/02/07 10:46:32 -0500,USD,3.478894,3364.0,True,10444,False,9.0,4.0,"unfelted stockinette; 12 sts=4"" felted.",Felted Clogs (AC33e),7.95,13156,2447,4.527594,3171.0,,500.0,850.0,,True,Worsted (9 wpi),"[{'id': 13, 'us': '13 ', 'metric': 9.0, 'us_st...","[{'id': 31924520, 'primary_pack_id': None, 'pr...","{'crochet_gauge': None, 'id': 12, 'knit_gauge'...","{'id': 2, 'name': 'Knitting', 'permalink': 'kn...","[{'id': 363, 'name': 'Slippers', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 10, ...","{'clothing': False, 'id': 9, 'name': 'Other', ..."
4,71,43,2007/02/08 14:53:37 -0500,,4.10219,1918.0,True,24533,True,8.0,1.0,pattern,Endpaper Mitts,,5391,7112,4.451213,1855.0,,360.0,,,False,Fingering (14 wpi),[],"[{'id': 767, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","{'id': 2, 'name': 'Knitting', 'permalink': 'kn...","[{'id': 395, 'name': 'Fingerless Gloves/Mitts'...","[{'id': 3, 'permalink': 'unisex'}, {'id': 9, '...","{'clothing': True, 'id': 8, 'name': 'Mittens/G..."


Pattern df cleaning to do:
- created_at : truncate to year/month/day, possibly convert format
- standardize yarn_weight_description (perhaps group thread and cobweb, aran/worsted into one or the other, dk/sport the same - look at pattern guauge to determine which way to go)
- packs : break out called-for yarn
- craft : break out craft name to new column
- pattern_type : break out clothing(T/F) and type name (Socks, Shawl/Wrap etc)


Wait on:
- pattern_needle_sizes : see if I need this info for analysis
- yarn_weight : might not need at all
- pattern_categories : might not need since pattern_type contains top-level category
- pattern_attributes : if I get extraordinarily ambitious, don't think I'm going to have time to get this far in the weeds

Thoughts on dealing with the dictionary columns:

As an example, the called-for yarn in the pattern details df is a key/value in a larger dictionary column. I want to extract the yarn name on its own into a new column.

In [4]:
patterns_df = patterns_df.rename(columns = {'name' : 'patt_name'})

In [5]:
patterns_df.pattern_type.values

array(["{'clothing': True, 'id': 2, 'name': 'Socks', 'permalink': 'socks'}",
       "{'clothing': True, 'id': 10, 'name': 'Shawl/Wrap', 'permalink': 'shawl'}",
       "{'clothing': True, 'id': 2, 'name': 'Socks', 'permalink': 'socks'}",
       ...,
       "{'clothing': True, 'id': 2, 'name': 'Socks', 'permalink': 'socks'}",
       "{'clothing': True, 'id': 10, 'name': 'Shawl/Wrap', 'permalink': 'shawl'}",
       "{'clothing': True, 'id': 10, 'name': 'Shawl/Wrap', 'permalink': 'shawl'}"],
      dtype=object)

In [6]:
patterns_df.craft.value_counts()

{'id': 2, 'name': 'Knitting', 'permalink': 'knitting'}                    24635
{'id': 1, 'name': 'Crochet', 'permalink': 'crochet'}                       5542
{'id': 7, 'name': 'Loom Knitting', 'permalink': 'loom-knitting'}             13
{'id': 6, 'name': 'Machine Knitting', 'permalink': 'machine-knitting'}        8
Name: craft, dtype: int64

# Split faux-dictionary columns

In [7]:
# split craft column

patterns_df['craft'] = patterns_df['craft'].apply(lambda x : dict(eval(x)))
temp = patterns_df['craft'].apply(pd.Series)
patterns_df = pd.concat([patterns_df, temp], axis = 1).drop('craft', axis = 1)
patterns_df.head()


Unnamed: 0,patt_id,comments_count,created_at,currency,difficulty_average,difficulty_count,downloadable,favorites_count,free,gauge,gauge_divisor,gauge_pattern,patt_name,price,projects_count,queued_projects_count,rating_average,rating_count,row_gauge,yardage,yardage_max,sizes_available,ravelry_download,yarn_weight_description,pattern_needle_sizes,packs,yarn_weight,pattern_categories,pattern_attributes,pattern_type,id,name,permalink
0,17,27,2007/01/12 00:51:53 -0500,USD,4.903936,2363.0,True,11207,True,8.0,1.0,stockinette stitch,Pomatomus,,5079,4223,4.468085,2256.0,12.0,388.0,,"S, M, L",False,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","[{'id': 412, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 23, 'permalink': 'top-cuff-down'}, {'i...","{'clothing': True, 'id': 2, 'name': 'Socks', '...",2,Knitting,knitting
1,29,70,2007/01/12 01:27:48 -0500,,2.647959,9283.0,True,23792,True,19.0,4.0,stockinette stitch,Clapotis,,23435,8219,4.431221,8520.0,25.0,820.0,,,False,Aran (8 wpi),"[{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_stee...","[{'id': 752, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 165, 'permalink': 'dropped-stitches'},...","{'clothing': True, 'id': 10, 'name': 'Shawl/Wr...",2,Knitting,knitting
2,40,58,2007/02/06 01:05:08 -0500,,2.939708,4445.0,True,13003,True,38.0,4.0,zigzag pattern,Jaywalker,,12071,4112,4.266636,4373.0,,465.0,,,True,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","[{'id': 430, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 9, '...","{'clothing': True, 'id': 2, 'name': 'Socks', '...",2,Knitting,knitting
3,54,100,2007/02/07 10:46:32 -0500,USD,3.478894,3364.0,True,10444,False,9.0,4.0,"unfelted stockinette; 12 sts=4"" felted.",Felted Clogs (AC33e),7.95,13156,2447,4.527594,3171.0,,500.0,850.0,,True,Worsted (9 wpi),"[{'id': 13, 'us': '13 ', 'metric': 9.0, 'us_st...","[{'id': 31924520, 'primary_pack_id': None, 'pr...","{'crochet_gauge': None, 'id': 12, 'knit_gauge'...","[{'id': 363, 'name': 'Slippers', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 10, ...","{'clothing': False, 'id': 9, 'name': 'Other', ...",2,Knitting,knitting
4,71,43,2007/02/08 14:53:37 -0500,,4.10219,1918.0,True,24533,True,8.0,1.0,pattern,Endpaper Mitts,,5391,7112,4.451213,1855.0,,360.0,,,False,Fingering (14 wpi),[],"[{'id': 767, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 395, 'name': 'Fingerless Gloves/Mitts'...","[{'id': 3, 'permalink': 'unisex'}, {'id': 9, '...","{'clothing': True, 'id': 8, 'name': 'Mittens/G...",2,Knitting,knitting


In [8]:
patterns_df = patterns_df.drop(['id', 'permalink'], 1)
patterns_df = patterns_df.rename(columns = {'name' : 'craft_name'})
patterns_df.head()

Unnamed: 0,patt_id,comments_count,created_at,currency,difficulty_average,difficulty_count,downloadable,favorites_count,free,gauge,gauge_divisor,gauge_pattern,patt_name,price,projects_count,queued_projects_count,rating_average,rating_count,row_gauge,yardage,yardage_max,sizes_available,ravelry_download,yarn_weight_description,pattern_needle_sizes,packs,yarn_weight,pattern_categories,pattern_attributes,pattern_type,craft_name
0,17,27,2007/01/12 00:51:53 -0500,USD,4.903936,2363.0,True,11207,True,8.0,1.0,stockinette stitch,Pomatomus,,5079,4223,4.468085,2256.0,12.0,388.0,,"S, M, L",False,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","[{'id': 412, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 23, 'permalink': 'top-cuff-down'}, {'i...","{'clothing': True, 'id': 2, 'name': 'Socks', '...",Knitting
1,29,70,2007/01/12 01:27:48 -0500,,2.647959,9283.0,True,23792,True,19.0,4.0,stockinette stitch,Clapotis,,23435,8219,4.431221,8520.0,25.0,820.0,,,False,Aran (8 wpi),"[{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_stee...","[{'id': 752, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 165, 'permalink': 'dropped-stitches'},...","{'clothing': True, 'id': 10, 'name': 'Shawl/Wr...",Knitting
2,40,58,2007/02/06 01:05:08 -0500,,2.939708,4445.0,True,13003,True,38.0,4.0,zigzag pattern,Jaywalker,,12071,4112,4.266636,4373.0,,465.0,,,True,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","[{'id': 430, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 9, '...","{'clothing': True, 'id': 2, 'name': 'Socks', '...",Knitting
3,54,100,2007/02/07 10:46:32 -0500,USD,3.478894,3364.0,True,10444,False,9.0,4.0,"unfelted stockinette; 12 sts=4"" felted.",Felted Clogs (AC33e),7.95,13156,2447,4.527594,3171.0,,500.0,850.0,,True,Worsted (9 wpi),"[{'id': 13, 'us': '13 ', 'metric': 9.0, 'us_st...","[{'id': 31924520, 'primary_pack_id': None, 'pr...","{'crochet_gauge': None, 'id': 12, 'knit_gauge'...","[{'id': 363, 'name': 'Slippers', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 10, ...","{'clothing': False, 'id': 9, 'name': 'Other', ...",Knitting
4,71,43,2007/02/08 14:53:37 -0500,,4.10219,1918.0,True,24533,True,8.0,1.0,pattern,Endpaper Mitts,,5391,7112,4.451213,1855.0,,360.0,,,False,Fingering (14 wpi),[],"[{'id': 767, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 395, 'name': 'Fingerless Gloves/Mitts'...","[{'id': 3, 'permalink': 'unisex'}, {'id': 9, '...","{'clothing': True, 'id': 8, 'name': 'Mittens/G...",Knitting


In [9]:
print(patterns_df.pattern_type.values)

["{'clothing': True, 'id': 2, 'name': 'Socks', 'permalink': 'socks'}"
 "{'clothing': True, 'id': 10, 'name': 'Shawl/Wrap', 'permalink': 'shawl'}"
 "{'clothing': True, 'id': 2, 'name': 'Socks', 'permalink': 'socks'}" ...
 "{'clothing': True, 'id': 2, 'name': 'Socks', 'permalink': 'socks'}"
 "{'clothing': True, 'id': 10, 'name': 'Shawl/Wrap', 'permalink': 'shawl'}"
 "{'clothing': True, 'id': 10, 'name': 'Shawl/Wrap', 'permalink': 'shawl'}"]


In [10]:
# tried same code on other problem columns, but getting errors even with other approaches
# new approach - they're strings, so use split and then trim extra characters

pattern_type_df = patterns_df.pattern_type.str.split(", ", expand = True)
pattern_type_df

Unnamed: 0,0,1,2,3
0,{'clothing': True,'id': 2,'name': 'Socks','permalink': 'socks'}
1,{'clothing': True,'id': 10,'name': 'Shawl/Wrap','permalink': 'shawl'}
2,{'clothing': True,'id': 2,'name': 'Socks','permalink': 'socks'}
3,{'clothing': False,'id': 9,'name': 'Other','permalink': 'other'}
4,{'clothing': True,'id': 8,'name': 'Mittens/Gloves','permalink': 'gloves'}
...,...,...,...,...
30193,{'clothing': True,'id': 2,'name': 'Socks','permalink': 'socks'}
30194,{'clothing': True,'id': 3,'name': 'Hat','permalink': 'hat'}
30195,{'clothing': True,'id': 2,'name': 'Socks','permalink': 'socks'}
30196,{'clothing': True,'id': 10,'name': 'Shawl/Wrap','permalink': 'shawl'}


In [11]:
# rename columns I want to keep
pattern_type_df = pattern_type_df.rename(columns = {0 : 'clothing', 2 : 'type_name'})

# remove label part of clothing column
pattern_type_df['clothing'] = pattern_type_df['clothing'].str.replace(r"{'clothing': ", '')

# remove label part of type_name column
pattern_type_df['type_name'] = pattern_type_df['type_name'].str.replace(r"'name': '", '')

# slice off trailing character in type_name column
pattern_type_df['type_name'] = pattern_type_df['type_name'].str.slice(0, -1)

# drop extra columns
pattern_type_df = pattern_type_df.drop([1, 3], 1)
pattern_type_df.head()


Unnamed: 0,clothing,type_name
0,True,Socks
1,True,Shawl/Wrap
2,True,Socks
3,False,Other
4,True,Mittens/Gloves


In [12]:
# result as expected
pattern_type_df.clothing.value_counts()

True     23161
False     7036
Name: clothing, dtype: int64

In [13]:
# result as expected
pattern_type_df.type_name.value_counts()

Shawl/Wrap        5445
Socks             3139
Hat               3008
Child             2214
Toys              2021
Scarf             1918
Mittens/Gloves    1647
Baby              1609
Pullover          1555
Blanket           1482
Cardigan          1448
Home              1170
Other             1094
Dishcloth          637
Bag                468
Tank/Camisole      242
Shrug              232
Tee                217
Vest               202
Pet                164
Jacket             141
Skirt               73
Dress/Suit          60
Naughty             11
Name: type_name, dtype: int64

In [14]:
# merge back to patterns_df on index
patterns_df = patterns_df.merge(pattern_type_df, how = 'outer', left_index = True, right_index = True)
patterns_df.tail()

Unnamed: 0,patt_id,comments_count,created_at,currency,difficulty_average,difficulty_count,downloadable,favorites_count,free,gauge,gauge_divisor,gauge_pattern,patt_name,price,projects_count,queued_projects_count,rating_average,rating_count,row_gauge,yardage,yardage_max,sizes_available,ravelry_download,yarn_weight_description,pattern_needle_sizes,packs,yarn_weight,pattern_categories,pattern_attributes,pattern_type,craft_name,clothing,type_name
30193,358,30,2007/03/23 21:08:29 -0400,USD,6.628429,401.0,True,7589,True,9.0,1.0,stockinette stitch,Bayerische Socks,,861,2909,4.591029,379.0,,412.0,,,False,Fingering (14 wpi),"[{'id': 19, 'us': '0', 'metric': 2.0, 'us_stee...","[{'id': 810, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 10, ...","{'clothing': True, 'id': 2, 'name': 'Socks', '...",Knitting,True,Socks
30194,216157,95,2010/12/10 00:19:56 -0500,,4.110482,353.0,True,9666,True,18.0,4.0,stockinette,Cloche Divine,,861,1906,4.41716,338.0,24.0,220.0,,"S (M,L)",True,Worsted (9 wpi),"[{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_stee...","[{'id': 14436223, 'primary_pack_id': None, 'pr...","{'crochet_gauge': None, 'id': 12, 'knit_gauge'...","[{'id': 417, 'name': 'Cloche', 'permalink': 'c...","[{'id': 2, 'permalink': 'female'}, {'id': 10, ...","{'clothing': True, 'id': 3, 'name': 'Hat', 'pe...",Knitting,True,Hat
30195,735965,16,2017/03/20 02:03:46 -0400,USD,3.871595,257.0,True,2135,False,33.0,4.0,stockinette stitch,Dropping Madness Socks,3.0,861,349,4.675277,271.0,43.0,361.0,481.0,"Small, Medium, Large",True,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","[{'id': 63728276, 'primary_pack_id': None, 'pr...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 22, 'permalink': 'toe-up'}, {'id': 261...","{'clothing': True, 'id': 2, 'name': 'Socks', '...",Knitting,True,Socks
30196,491921,25,2014/05/17 01:58:18 -0400,GBP,2.61745,298.0,True,8648,False,22.0,4.0,garter stitch - washed and blocked (vigorously),A Hap for Harriet,4.98,860,1393,4.693103,290.0,39.0,800.0,800.0,Sizing is flexible: to fit your preferences a...,True,Lace,"[{'id': 3, 'us': '3 ', 'metric': 3.25, 'us_ste...","[{'id': 65145098, 'primary_pack_id': None, 'pr...","{'crochet_gauge': '', 'id': 7, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 62, 'permalink': 'lace'}, {'id': 204, ...","{'clothing': True, 'id': 10, 'name': 'Shawl/Wr...",Knitting,True,Shawl/Wrap
30197,576820,63,2015/05/01 07:17:46 -0400,USD,2.8,210.0,True,13028,False,24.0,4.0,stockinette stitch,Everyday Shawl,7.0,861,2083,4.636792,212.0,34.0,1650.0,,One Size (see notes for measurements),True,Fingering (14 wpi),"[{'id': 5, 'us': '5 ', 'metric': 3.75, 'us_ste...","[{'id': 47150919, 'primary_pack_id': None, 'pr...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 1, 'permalink': 'male'}, {'id': 2, 'pe...","{'clothing': True, 'id': 10, 'name': 'Shawl/Wr...",Knitting,True,Shawl/Wrap


In [15]:
# drop original column
patterns_df = patterns_df.drop(['pattern_type'], 1)
patterns_df.tail()

Unnamed: 0,patt_id,comments_count,created_at,currency,difficulty_average,difficulty_count,downloadable,favorites_count,free,gauge,gauge_divisor,gauge_pattern,patt_name,price,projects_count,queued_projects_count,rating_average,rating_count,row_gauge,yardage,yardage_max,sizes_available,ravelry_download,yarn_weight_description,pattern_needle_sizes,packs,yarn_weight,pattern_categories,pattern_attributes,craft_name,clothing,type_name
30193,358,30,2007/03/23 21:08:29 -0400,USD,6.628429,401.0,True,7589,True,9.0,1.0,stockinette stitch,Bayerische Socks,,861,2909,4.591029,379.0,,412.0,,,False,Fingering (14 wpi),"[{'id': 19, 'us': '0', 'metric': 2.0, 'us_stee...","[{'id': 810, 'primary_pack_id': None, 'project...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 10, ...",Knitting,True,Socks
30194,216157,95,2010/12/10 00:19:56 -0500,,4.110482,353.0,True,9666,True,18.0,4.0,stockinette,Cloche Divine,,861,1906,4.41716,338.0,24.0,220.0,,"S (M,L)",True,Worsted (9 wpi),"[{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_stee...","[{'id': 14436223, 'primary_pack_id': None, 'pr...","{'crochet_gauge': None, 'id': 12, 'knit_gauge'...","[{'id': 417, 'name': 'Cloche', 'permalink': 'c...","[{'id': 2, 'permalink': 'female'}, {'id': 10, ...",Knitting,True,Hat
30195,735965,16,2017/03/20 02:03:46 -0400,USD,3.871595,257.0,True,2135,False,33.0,4.0,stockinette stitch,Dropping Madness Socks,3.0,861,349,4.675277,271.0,43.0,361.0,481.0,"Small, Medium, Large",True,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","[{'id': 63728276, 'primary_pack_id': None, 'pr...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 22, 'permalink': 'toe-up'}, {'id': 261...",Knitting,True,Socks
30196,491921,25,2014/05/17 01:58:18 -0400,GBP,2.61745,298.0,True,8648,False,22.0,4.0,garter stitch - washed and blocked (vigorously),A Hap for Harriet,4.98,860,1393,4.693103,290.0,39.0,800.0,800.0,Sizing is flexible: to fit your preferences a...,True,Lace,"[{'id': 3, 'us': '3 ', 'metric': 3.25, 'us_ste...","[{'id': 65145098, 'primary_pack_id': None, 'pr...","{'crochet_gauge': '', 'id': 7, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 62, 'permalink': 'lace'}, {'id': 204, ...",Knitting,True,Shawl/Wrap
30197,576820,63,2015/05/01 07:17:46 -0400,USD,2.8,210.0,True,13028,False,24.0,4.0,stockinette stitch,Everyday Shawl,7.0,861,2083,4.636792,212.0,34.0,1650.0,,One Size (see notes for measurements),True,Fingering (14 wpi),"[{'id': 5, 'us': '5 ', 'metric': 3.75, 'us_ste...","[{'id': 47150919, 'primary_pack_id': None, 'pr...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 1, 'permalink': 'male'}, {'id': 2, 'pe...",Knitting,True,Shawl/Wrap


In [16]:
# repeat process for packs column
packs_df = patterns_df.packs.str.split(", ", expand = True)
packs_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,...,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949
0,[{'id': 412,'primary_pack_id': None,'project_id': None,'skeins': None,'stash_id': None,'total_grams': None,'total_meters': None,'total_ounces': None,'total_yards': None,'yarn_id': 5512,'yarn_name': 'Shelridge Yarns Soft Touch Ultra...,'yarn_weight': {'crochet_gauge': '','id': 5,'knit_gauge': '28','max_gauge': None,'min_gauge': None,'name': 'Fingering','ply': '4','wpi': '14'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'shelridge-yarns-soft-to...,'id': 5512,'name': 'Soft Touch Ultra Solid Colors','yarn_company_name': 'Shelridge Yarns','yarn_company_id': 174},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,[{'id': 752,'primary_pack_id': None,'project_id': None,'skeins': None,'stash_id': None,'total_grams': None,'total_meters': None,'total_ounces': None,'total_yards': None,'yarn_id': 679,"'yarn_name': ""Lorna's Laces Lion & Lamb Multi""",'yarn_weight': {'crochet_gauge': '','id': 1,'knit_gauge': '18','max_gauge': None,'min_gauge': None,'name': 'Aran','ply': '10','wpi': '8'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'lornas-laces-lion--lamb...,'id': 679,'name': 'Lion & Lamb Multi',"'yarn_company_name': ""Lorna's Laces""",'yarn_company_id': 38},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,[{'id': 430,'primary_pack_id': None,'project_id': None,'skeins': None,'stash_id': None,'total_grams': None,'total_meters': None,'total_ounces': None,'total_yards': None,'yarn_id': 3349,'yarn_name': 'Zwerger Garn Opal Handpainted / ...,'yarn_weight': {'crochet_gauge': None,'id': 13,'knit_gauge': '32','max_gauge': None,'min_gauge': None,'name': 'Light Fingering','ply': '3','wpi': None},'colorway': None,'shop_name': None,'yarn': {'permalink': 'zwerger-garn-opal-handp...,'id': 3349,'name': 'Opal Handpainted / Handgefärbt','yarn_company_name': 'Zwerger Garn','yarn_company_id': 631},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,[{'id': 31924520,'primary_pack_id': None,'project_id': None,'skeins': None,'stash_id': None,'total_grams': None,'total_meters': None,'total_ounces': None,'total_yards': None,'yarn_id': None,'yarn_name': None,'yarn_weight': None,'colorway': None,'shop_name': None,'yarn': None,'quantity_description': None,'personal_name': '','dye_lot': None,'color_family_id': None,'grams_per_skein': None,'yards_per_skein': None,'meters_per_skein': None,'ounces_per_skein': None,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,[{'id': 767,'primary_pack_id': None,'project_id': None,'skeins': None,'stash_id': None,'total_grams': None,'total_meters': None,'total_ounces': None,'total_yards': None,'yarn_id': 1114,'yarn_name': 'Louet Gems Pearl','yarn_weight': {'crochet_gauge': '','id': 5,'knit_gauge': '28','max_gauge': None,'min_gauge': None,'name': 'Fingering','ply': '4','wpi': '14'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'louet-gems-pearl','id': 1114,'name': 'Gems Pearl','yarn_company_name': 'Louet','yarn_company_id': 52},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [17]:
print(packs_df[10].values)

["'yarn_name': 'Shelridge Yarns Soft Touch Ultra Solid Colors'"
 '\'yarn_name\': "Lorna\'s Laces Lion & Lamb Multi"'
 "'yarn_name': 'Zwerger Garn Opal Handpainted / Handgefärbt'" ...
 "'yarn_name': None" "'yarn_name': 'Fyberspates Gleem Lace'"
 "'yarn_name': 'Brooklyn Tweed Loft'"]


In [18]:
# rename columns I want to keep
packs_df = packs_df.rename(columns = {10 : 'patt_yarn', 16 : 'patt_yarn_weight'})

# remove label part of patt_yarn column
packs_df['patt_yarn'] = packs_df['patt_yarn'].str.replace(r"'yarn_name': '", '')

# remove label part of patt_yarn_weight column
packs_df['patt_yarn_weight'] = packs_df['patt_yarn_weight'].str.replace(r"'name': '", '')

# slice off trailing characters in patt_yarn and Patt_yarn_weight columns
packs_df['patt_yarn'] = packs_df['patt_yarn'].str.slice(0, -1)
packs_df['patt_yarn_weight'] = packs_df['patt_yarn_weight'].str.slice(0, -1)

# drop extra columns
packs_df = packs_df.drop(packs_df.iloc[:, 0:10], axis = 1)
packs_df.head()


Unnamed: 0,patt_yarn,11,12,13,14,15,patt_yarn_weight,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949
0,Shelridge Yarns Soft Touch Ultra Solid Colors,'yarn_weight': {'crochet_gauge': '','id': 5,'knit_gauge': '28','max_gauge': None,'min_gauge': None,Fingering,'ply': '4','wpi': '14'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'shelridge-yarns-soft-to...,'id': 5512,'name': 'Soft Touch Ultra Solid Colors','yarn_company_name': 'Shelridge Yarns','yarn_company_id': 174},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,'grams_per_skein': 50,'yards_per_skein': 191.0,'meters_per_skein': 174.7,'ounces_per_skein': 1.76,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None},{'id': 47707830,'primary_pack_id': None,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,"'yarn_name': ""Lorna's Laces Lion & Lamb Multi",'yarn_weight': {'crochet_gauge': '','id': 1,'knit_gauge': '18','max_gauge': None,'min_gauge': None,Aran,'ply': '10','wpi': '8'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'lornas-laces-lion--lamb...,'id': 679,'name': 'Lion & Lamb Multi',"'yarn_company_name': ""Lorna's Laces""",'yarn_company_id': 38},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,'grams_per_skein': 100,'yards_per_skein': 205.0,'meters_per_skein': 187.5,'ounces_per_skein': 3.53,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Zwerger Garn Opal Handpainted / Handgefärbt,'yarn_weight': {'crochet_gauge': None,'id': 13,'knit_gauge': '32','max_gauge': None,'min_gauge': None,Light Fingering,'ply': '3','wpi': None},'colorway': None,'shop_name': None,'yarn': {'permalink': 'zwerger-garn-opal-handp...,'id': 3349,'name': 'Opal Handpainted / Handgefärbt','yarn_company_name': 'Zwerger Garn','yarn_company_id': 631},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,'grams_per_skein': 100,'yards_per_skein': 465.0,'meters_per_skein': 425.2,'ounces_per_skein': 3.53,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,'yarn_name': Non,'yarn_weight': None,'colorway': None,'shop_name': None,'yarn': None,'quantity_description': None,'personal_name': ','dye_lot': None,'color_family_id': None,'grams_per_skein': None,'yards_per_skein': None,'meters_per_skein': None,'ounces_per_skein': None,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,Louet Gems Pearl,'yarn_weight': {'crochet_gauge': '','id': 5,'knit_gauge': '28','max_gauge': None,'min_gauge': None,Fingering,'ply': '4','wpi': '14'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'louet-gems-pearl','id': 1114,'name': 'Gems Pearl','yarn_company_name': 'Louet','yarn_company_id': 52},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,'grams_per_skein': 50,'yards_per_skein': 185.0,'meters_per_skein': 169.2,'ounces_per_skein': 1.76,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [19]:
# drop extra columns part 2

packs_df = packs_df.drop(packs_df.iloc[:, 1:6], axis = 1)
packs_df.head()

Unnamed: 0,patt_yarn,patt_yarn_weight,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,...,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949
0,Shelridge Yarns Soft Touch Ultra Solid Colors,Fingering,'ply': '4','wpi': '14'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'shelridge-yarns-soft-to...,'id': 5512,'name': 'Soft Touch Ultra Solid Colors','yarn_company_name': 'Shelridge Yarns','yarn_company_id': 174},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,'grams_per_skein': 50,'yards_per_skein': 191.0,'meters_per_skein': 174.7,'ounces_per_skein': 1.76,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None},{'id': 47707830,'primary_pack_id': None,'project_id': None,'skeins': None,'stash_id': None,'total_grams': None,'total_meters': None,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,"'yarn_name': ""Lorna's Laces Lion & Lamb Multi",Aran,'ply': '10','wpi': '8'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'lornas-laces-lion--lamb...,'id': 679,'name': 'Lion & Lamb Multi',"'yarn_company_name': ""Lorna's Laces""",'yarn_company_id': 38},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,'grams_per_skein': 100,'yards_per_skein': 205.0,'meters_per_skein': 187.5,'ounces_per_skein': 3.53,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Zwerger Garn Opal Handpainted / Handgefärbt,Light Fingering,'ply': '3','wpi': None},'colorway': None,'shop_name': None,'yarn': {'permalink': 'zwerger-garn-opal-handp...,'id': 3349,'name': 'Opal Handpainted / Handgefärbt','yarn_company_name': 'Zwerger Garn','yarn_company_id': 631},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,'grams_per_skein': 100,'yards_per_skein': 465.0,'meters_per_skein': 425.2,'ounces_per_skein': 3.53,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,'yarn_name': Non,'personal_name': ','dye_lot': None,'color_family_id': None,'grams_per_skein': None,'yards_per_skein': None,'meters_per_skein': None,'ounces_per_skein': None,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,Louet Gems Pearl,Fingering,'ply': '4','wpi': '14'},'colorway': None,'shop_name': None,'yarn': {'permalink': 'louet-gems-pearl','id': 1114,'name': 'Gems Pearl','yarn_company_name': 'Louet','yarn_company_id': 52},'quantity_description': None,'personal_name': None,'dye_lot': None,'color_family_id': None,'grams_per_skein': 50,'yards_per_skein': 185.0,'meters_per_skein': 169.2,'ounces_per_skein': 1.76,'prefer_metric_weight': True,'prefer_metric_length': False,'shop_id': None,'thread_size': None}],,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [20]:
# final dropping of extra columns

packs_df = packs_df.drop(packs_df.iloc[:, 2:], axis = 1)
packs_df.head()

Unnamed: 0,patt_yarn,patt_yarn_weight
0,Shelridge Yarns Soft Touch Ultra Solid Colors,Fingering
1,"'yarn_name': ""Lorna's Laces Lion & Lamb Multi",Aran
2,Zwerger Garn Opal Handpainted / Handgefärbt,Light Fingering
3,'yarn_name': Non,'personal_name': '
4,Louet Gems Pearl,Fingering


In [21]:
# merge back to patterns_df on index
patterns_df = patterns_df.merge(packs_df, how = 'outer', left_index = True, right_index = True)

# drop original column
patterns_df = patterns_df.drop(['packs'], 1)
patterns_df.tail()

Unnamed: 0,patt_id,comments_count,created_at,currency,difficulty_average,difficulty_count,downloadable,favorites_count,free,gauge,gauge_divisor,gauge_pattern,patt_name,price,projects_count,queued_projects_count,rating_average,rating_count,row_gauge,yardage,yardage_max,sizes_available,ravelry_download,yarn_weight_description,pattern_needle_sizes,yarn_weight,pattern_categories,pattern_attributes,craft_name,clothing,type_name,patt_yarn,patt_yarn_weight
30193,358,30,2007/03/23 21:08:29 -0400,USD,6.628429,401.0,True,7589,True,9.0,1.0,stockinette stitch,Bayerische Socks,,861,2909,4.591029,379.0,,412.0,,,False,Fingering (14 wpi),"[{'id': 19, 'us': '0', 'metric': 2.0, 'us_stee...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 10, ...",Knitting,True,Socks,Lang Yarns Jawoll Superwash Solids,Fingering
30194,216157,95,2010/12/10 00:19:56 -0500,,4.110482,353.0,True,9666,True,18.0,4.0,stockinette,Cloche Divine,,861,1906,4.41716,338.0,24.0,220.0,,"S (M,L)",True,Worsted (9 wpi),"[{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_stee...","{'crochet_gauge': None, 'id': 12, 'knit_gauge'...","[{'id': 417, 'name': 'Cloche', 'permalink': 'c...","[{'id': 2, 'permalink': 'female'}, {'id': 10, ...",Knitting,True,Hat,Knit Picks Swish Worsted,Worsted
30195,735965,16,2017/03/20 02:03:46 -0400,USD,3.871595,257.0,True,2135,False,33.0,4.0,stockinette stitch,Dropping Madness Socks,3.0,861,349,4.675277,271.0,43.0,361.0,481.0,"Small, Medium, Large",True,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 22, 'permalink': 'toe-up'}, {'id': 261...",Knitting,True,Socks,'yarn_name': Non,'personal_name': '
30196,491921,25,2014/05/17 01:58:18 -0400,GBP,2.61745,298.0,True,8648,False,22.0,4.0,garter stitch - washed and blocked (vigorously),A Hap for Harriet,4.98,860,1393,4.693103,290.0,39.0,800.0,800.0,Sizing is flexible: to fit your preferences a...,True,Lace,"[{'id': 3, 'us': '3 ', 'metric': 3.25, 'us_ste...","{'crochet_gauge': '', 'id': 7, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 62, 'permalink': 'lace'}, {'id': 204, ...",Knitting,True,Shawl/Wrap,Fyberspates Gleem Lace,Lace
30197,576820,63,2015/05/01 07:17:46 -0400,USD,2.8,210.0,True,13028,False,24.0,4.0,stockinette stitch,Everyday Shawl,7.0,861,2083,4.636792,212.0,34.0,1650.0,,One Size (see notes for measurements),True,Fingering (14 wpi),"[{'id': 5, 'us': '5 ', 'metric': 3.75, 'us_ste...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 1, 'permalink': 'male'}, {'id': 2, 'pe...",Knitting,True,Shawl/Wrap,Brooklyn Tweed Loft,Fingering


# Other cleaning on patterns_df

In [23]:
patterns_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30198 entries, 0 to 30197
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   patt_id                  30198 non-null  int64  
 1   comments_count           30198 non-null  int64  
 2   created_at               30198 non-null  object 
 3   currency                 21236 non-null  object 
 4   difficulty_average       30198 non-null  float64
 5   difficulty_count         30175 non-null  float64
 6   downloadable             30197 non-null  object 
 7   favorites_count          30198 non-null  int64  
 8   free                     30197 non-null  object 
 9   gauge                    21856 non-null  float64
 10  gauge_divisor            23684 non-null  float64
 11  gauge_pattern            19890 non-null  object 
 12  patt_name                30198 non-null  object 
 13  price                    11263 non-null  float64
 14  projects_count        

In [24]:
# truncate created_at column
patterns_df['created_at'] = patterns_df['created_at'].str.slice(0, 11)
patterns_df.tail()

Unnamed: 0,patt_id,comments_count,created_at,currency,difficulty_average,difficulty_count,downloadable,favorites_count,free,gauge,gauge_divisor,gauge_pattern,patt_name,price,projects_count,queued_projects_count,rating_average,rating_count,row_gauge,yardage,yardage_max,sizes_available,ravelry_download,yarn_weight_description,pattern_needle_sizes,yarn_weight,pattern_categories,pattern_attributes,craft_name,clothing,type_name,patt_yarn,patt_yarn_weight
30193,358,30,2007/03/23,USD,6.628429,401.0,True,7589,True,9.0,1.0,stockinette stitch,Bayerische Socks,,861,2909,4.591029,379.0,,412.0,,,False,Fingering (14 wpi),"[{'id': 19, 'us': '0', 'metric': 2.0, 'us_stee...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 3, 'permalink': 'unisex'}, {'id': 10, ...",Knitting,True,Socks,Lang Yarns Jawoll Superwash Solids,Fingering
30194,216157,95,2010/12/10,,4.110482,353.0,True,9666,True,18.0,4.0,stockinette,Cloche Divine,,861,1906,4.41716,338.0,24.0,220.0,,"S (M,L)",True,Worsted (9 wpi),"[{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_stee...","{'crochet_gauge': None, 'id': 12, 'knit_gauge'...","[{'id': 417, 'name': 'Cloche', 'permalink': 'c...","[{'id': 2, 'permalink': 'female'}, {'id': 10, ...",Knitting,True,Hat,Knit Picks Swish Worsted,Worsted
30195,735965,16,2017/03/20,USD,3.871595,257.0,True,2135,False,33.0,4.0,stockinette stitch,Dropping Madness Socks,3.0,861,349,4.675277,271.0,43.0,361.0,481.0,"Small, Medium, Large",True,Fingering (14 wpi),"[{'id': 1, 'us': '1 ', 'metric': 2.25, 'us_ste...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 885, 'name': 'Mid-calf', 'permalink': ...","[{'id': 22, 'permalink': 'toe-up'}, {'id': 261...",Knitting,True,Socks,'yarn_name': Non,'personal_name': '
30196,491921,25,2014/05/17,GBP,2.61745,298.0,True,8648,False,22.0,4.0,garter stitch - washed and blocked (vigorously),A Hap for Harriet,4.98,860,1393,4.693103,290.0,39.0,800.0,800.0,Sizing is flexible: to fit your preferences a...,True,Lace,"[{'id': 3, 'us': '3 ', 'metric': 3.25, 'us_ste...","{'crochet_gauge': '', 'id': 7, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 62, 'permalink': 'lace'}, {'id': 204, ...",Knitting,True,Shawl/Wrap,Fyberspates Gleem Lace,Lace
30197,576820,63,2015/05/01,USD,2.8,210.0,True,13028,False,24.0,4.0,stockinette stitch,Everyday Shawl,7.0,861,2083,4.636792,212.0,34.0,1650.0,,One Size (see notes for measurements),True,Fingering (14 wpi),"[{'id': 5, 'us': '5 ', 'metric': 3.75, 'us_ste...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 350, 'name': 'Shawl / Wrap', 'permalin...","[{'id': 1, 'permalink': 'male'}, {'id': 2, 'pe...",Knitting,True,Shawl/Wrap,Brooklyn Tweed Loft,Fingering


In [25]:
patterns_df.to_csv('../data/df_patterns_clean2.csv', index = False)

----------------

# Shop data cleaning

In [75]:
shops_df = pd.read_csv('../data/df_shop_clean.csv')

In [76]:
shops_df.head()

Unnamed: 0,address,city,id,latitude,location,longitude,name,pos_online,ravelry_retailer,shop_email,zip,country,state
0,5 Alabama Avenue (soon),LaFayette,6459,,"5 Alabama Avenue (soon), LaFayette, Alabama",,Courthouse Yanier,True,False,phootsy@courthouseyarnier.com,36862,"{'id': 229, 'name': 'United States'}","{'id': 3596, 'name': 'Alabama'}"
1,817B Regal Drive,Huntsville,9966,34.7091,"817B Regal Drive, Huntsville, Alabama",-86.5875,Fiber Art Work,True,True,fiberartwork@gmail.com,35801,"{'id': 229, 'name': 'United States'}","{'id': 3596, 'name': 'Alabama'}"
2,"25219 Hwy 195, P.O. Box 392 (for mailing)",Double Springs,11655,34.1465,"25219 Hwy 195, P.O. Box 392 (for mailing), Dou...",-87.4022,Fine Yarns on Main,False,True,fineyarnsonmain@gmail.com,35553,"{'id': 229, 'name': 'United States'}","{'id': 3596, 'name': 'Alabama'}"
3,15314 Court Street,Moulton,8023,34.4825,"15314 Court Street, Moulton, Alabama",-87.2766,Granny’s Quilt Shop,False,False,,35650,"{'id': 229, 'name': 'United States'}","{'id': 3596, 'name': 'Alabama'}"
4,105 D Church Street,Madison,12262,34.6946,"105 D Church Street, Madison, Alabama",-86.7487,Hook A Frog Fiber & Fun,True,False,hookafrog@gmail.com,35758,"{'id': 229, 'name': 'United States'}","{'id': 3596, 'name': 'Alabama'}"


In [77]:
# slice off extraneous info from country and state columns

shops_df['country'] = shops_df['country'].str.replace(r"{'id': 229, 'name': '", '')
shops_df['country'] = shops_df['country'].str.slice(0, -2)
shops_df['state'] = shops_df['state'].str.slice(22, -2)

shops_df.tail()

Unnamed: 0,address,city,id,latitude,location,longitude,name,pos_online,ravelry_retailer,shop_email,zip,country,state
2530,PO Box 731,Encampment,254,41.2095,"PO Box 731, Encampment, Wyoming",-106.789,Sheep Shed Studio,False,False,,82325,United States,Wyoming
2531,146 Coffeen Ave.,Sheridan,6352,44.7944,"146 Coffeen Ave., Sheridan, Wyoming",-106.954,The Fiber House,True,True,info@thefiberhouse.com,82801,United States,Wyoming
2532,Five G Lane,Lander,12484,42.7556,"Five G Lane, Lander, Wyoming",-110.998,The Knitting Dragon,False,False,,82520,United States,Wyoming
2533,"232 E 2nd St, Suite 103",Casper,13393,42.849,"232 E 2nd St, Suite 103 , Casper, Wyoming",-106.323,Windblown Fibers,False,True,windblownfibers@gmail.com,82601,United States,Wyoming
2534,10 Knoll Dr.,Laramie,6351,41.4214,"10 Knoll Dr., Laramie, Wyoming",-105.615,Woobee Knitshop,True,False,,82072,United States,Wyoming


In [78]:
shops_df.to_csv('../data/df_shops_clean2.csv', index = False)

# Yarn data cleaning

In [122]:
yarns_df = pd.read_csv('../data/df_yarn_clean.csv')
yarns_df.head()

Unnamed: 0,discontinued,gauge_divisor,grams,machine_washable,max_gauge,min_gauge,name,rating_average,rating_count,rating_total,texture,yardage,yarn_company_name,yarn_weight_x,yarn_id,min_needle_size,max_needle_size,min_hook_size,max_hook_size,yarn_weight_y,yarn_fibers
0,False,4.0,198.0,True,,17.0,Super Saver Solids,3.56,16543,58835,cable plied,364.0,Red Heart,"{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...",2059,"{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_steel...",,"{'id': 9, 'us': '9 ', 'metric': 5.5, 'us_steel...",,"{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...","[{'id': 3430, 'percentage': 100, 'fiber_type':..."
1,False,4.0,170.0,True,,18.0,Simply Soft,4.02,18225,73301,plied,315.0,Caron,"{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...",3330,"{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_steel...",,"{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_steel...",,"{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...","[{'id': 11186, 'percentage': 100, 'fiber_type'..."
2,False,4.0,100.0,,20.0,18.0,Cascade 220®,4.48,20647,92463,plied,220.0,Cascade Yarns ®,"{'crochet_gauge': None, 'id': 12, 'knit_gauge'...",523,"{'id': 7, 'us': '7 ', 'metric': 4.5, 'us_steel...","{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_steel...","{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_steel...","{'id': 10, 'us': '10.0', 'metric': 6.0, 'us_st...","{'crochet_gauge': None, 'id': 12, 'knit_gauge'...","[{'id': 1, 'percentage': 100, 'fiber_type': {'..."
3,False,4.0,100.0,True,,16.0,Vanna's Choice,3.86,13681,52876,Plied,170.0,Lion Brand,"{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...",5741,"{'id': 9, 'us': '9 ', 'metric': 5.5, 'us_steel...",,"{'id': 10, 'us': '10 ', 'metric': 6.0, 'us_ste...",,"{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...","[{'id': 7358, 'percentage': 100, 'fiber_type':..."
4,False,4.0,100.0,,,18.0,Worsted,4.73,19847,93852,singles,210.0,Malabrigo Yarn,"{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...",1666,"{'id': 7, 'us': '7 ', 'metric': 4.5, 'us_steel...","{'id': 9, 'us': '9 ', 'metric': 5.5, 'us_steel...",,,"{'crochet_gauge': '', 'id': 1, 'knit_gauge': '...","[{'id': 9882, 'percentage': 100, 'fiber_type':..."


Cleaning needed:
- texture : clean up categories (multiple versions of plied etc)
- break out yarn_weight, also drop the extra one that made it through the previous step
- break out yarn_fibers

save for if needed:
- break out min and max needle size, keep metric
- break out min and max hook size, keep metric

In [123]:
yarns_df = yarns_df.rename(columns = {'name' : 'yarn_name'})

In [124]:
yarns_df.texture.value_counts()

plied             3548
Plied             1320
plied              642
plied              282
singles            245
                  ... 
Metallic             1
rustic & wooly       1
shiny plied          1
Ladder Ribbon        1
Plied, tweed         1
Name: texture, Length: 890, dtype: int64

In [125]:
# trim white space
yarns_df['texture'] = yarns_df['texture'].str.strip()
yarns_df.texture.value_counts()

plied                        4548
Plied                        1427
singles                       253
plied, fluffy                  89
smooth                         82
                             ... 
lofty 2-ply                     1
One ply "Lopi" like             1
crimped                         1
Slubby, Handpsun                1
plied, mercerized, smooth       1
Name: texture, Length: 863, dtype: int64

Easier to do the rest of this step in Excel.

# Split faux-dictionary columns

In [126]:
# yarn_weight
yarn_weight_df = yarns_df.yarn_weight_x.str.split(", ", expand = True)
yarn_weight_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,{'crochet_gauge': '','id': 1,'knit_gauge': '18','max_gauge': None,'min_gauge': None,'name': 'Aran','ply': '10','wpi': '8'}
1,{'crochet_gauge': '','id': 1,'knit_gauge': '18','max_gauge': None,'min_gauge': None,'name': 'Aran','ply': '10','wpi': '8'}
2,{'crochet_gauge': None,'id': 12,'knit_gauge': '20','max_gauge': None,'min_gauge': None,'name': 'Worsted','ply': '10','wpi': '9'}
3,{'crochet_gauge': '','id': 1,'knit_gauge': '18','max_gauge': None,'min_gauge': None,'name': 'Aran','ply': '10','wpi': '8'}
4,{'crochet_gauge': '','id': 1,'knit_gauge': '18','max_gauge': None,'min_gauge': None,'name': 'Aran','ply': '10','wpi': '8'}
...,...,...,...,...,...,...,...,...
9995,{'crochet_gauge': None,'id': 11,'knit_gauge': '22','max_gauge': None,'min_gauge': None,'name': 'DK','ply': '8','wpi': '11'}
9996,{'crochet_gauge': None,'id': 10,'knit_gauge': '24-26','max_gauge': None,'min_gauge': None,'name': 'Sport','ply': '5','wpi': '12'}
9997,{'crochet_gauge': '','id': 5,'knit_gauge': '28','max_gauge': None,'min_gauge': None,'name': 'Fingering','ply': '4','wpi': '14'}
9998,{'crochet_gauge': '','id': 7,'knit_gauge': '32-34','max_gauge': None,'min_gauge': None,'name': 'Lace','ply': '2','wpi': None}


In [127]:
# rename columns I want to keep
yarn_weight_df = yarn_weight_df.rename(columns = {5 : 'yarn_weight', 7 : 'wpi'})

# remove label part of columns
yarn_weight_df['yarn_weight'] = yarn_weight_df['yarn_weight'].str.replace(r"'name': '", '')
yarn_weight_df['wpi'] = yarn_weight_df['wpi'].str.replace(r"'wpi': '", '')

# slice off trailing character in columns
yarn_weight_df['yarn_weight'] = yarn_weight_df['yarn_weight'].str.slice(0, -1)
yarn_weight_df['wpi'] = yarn_weight_df['wpi'].str.slice(0, -2)

# drop extra columns
yarn_weight_df = yarn_weight_df.drop([0, 1, 2, 3, 4, 6], 1)
yarn_weight_df.head()


Unnamed: 0,yarn_weight,wpi
0,Aran,8
1,Aran,8
2,Worsted,9
3,Aran,8
4,Aran,8


In [128]:
# merge back to yarns_df on index
yarns_df = yarns_df.merge(yarn_weight_df, how = 'outer', left_index = True, right_index = True)
yarns_df.tail()

Unnamed: 0,discontinued,gauge_divisor,grams,machine_washable,max_gauge,min_gauge,yarn_name,rating_average,rating_count,rating_total,texture,yardage,yarn_company_name,yarn_weight_x,yarn_id,min_needle_size,max_needle_size,min_hook_size,max_hook_size,yarn_weight_y,yarn_fibers,yarn_weight,wpi
9995,True,4.0,100.0,True,,22.0,Big Baby Fair Isle 8 ply,3.98,41,163,plied,300.0,Patons Australia,"{'crochet_gauge': None, 'id': 11, 'knit_gauge'...",116193,"{'id': 6, 'us': '6 ', 'metric': 4.0, 'us_steel...",,,,"{'crochet_gauge': None, 'id': 11, 'knit_gauge'...","[{'id': 223285, 'percentage': 40, 'fiber_type'...",DK,11
9996,True,4.0,50.0,True,,24.0,Tiffany,3.92,53,208,eyelash,137.0,Lion Brand,"{'crochet_gauge': None, 'id': 10, 'knit_gauge'...",8295,"{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_steel...",,"{'id': 10, 'us': '10 ', 'metric': 6.0, 'us_ste...",,"{'crochet_gauge': None, 'id': 10, 'knit_gauge'...","[{'id': 9403, 'percentage': 100, 'fiber_type':...",Sport,12
9997,True,4.0,100.0,True,,28.0,Bungee,3.92,53,208,plied,403.0,Plymouth Yarn,"{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...",50956,"{'id': 3, 'us': '3 ', 'metric': 3.25, 'us_stee...",,,,"{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 61956, 'percentage': 5, 'fiber_type': ...",Fingering,14
9998,True,,454.0,False,,,Rick Rack II,3.84,49,188,textured,1400.0,Interlacements,"{'crochet_gauge': '', 'id': 7, 'knit_gauge': '...",6483,,,,,"{'crochet_gauge': '', 'id': 7, 'knit_gauge': '...","[{'id': 7932, 'percentage': 100, 'fiber_type':...",Lace,'wpi': Non
9999,False,4.0,100.0,True,28.0,28.0,Invicta Glamour,3.82,28,107,plied,376.0,Scheepjes,"{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...",103047,"{'id': 20, 'us': '2½', 'metric': 3.0, 'us_stee...","{'id': 20, 'us': '2½', 'metric': 3.0, 'us_stee...","{'id': 20, 'us': '2½', 'metric': 3.0, 'us_stee...","{'id': 20, 'us': '2½', 'metric': 3.0, 'us_stee...","{'crochet_gauge': '', 'id': 5, 'knit_gauge': '...","[{'id': 201090, 'percentage': 2, 'fiber_type':...",Fingering,14


In [129]:
# drop original columns
yarns_df = yarns_df.drop(['yarn_weight_x', 'yarn_weight_y'], 1)
yarns_df.tail()

Unnamed: 0,discontinued,gauge_divisor,grams,machine_washable,max_gauge,min_gauge,yarn_name,rating_average,rating_count,rating_total,texture,yardage,yarn_company_name,yarn_id,min_needle_size,max_needle_size,min_hook_size,max_hook_size,yarn_fibers,yarn_weight,wpi
9995,True,4.0,100.0,True,,22.0,Big Baby Fair Isle 8 ply,3.98,41,163,plied,300.0,Patons Australia,116193,"{'id': 6, 'us': '6 ', 'metric': 4.0, 'us_steel...",,,,"[{'id': 223285, 'percentage': 40, 'fiber_type'...",DK,11
9996,True,4.0,50.0,True,,24.0,Tiffany,3.92,53,208,eyelash,137.0,Lion Brand,8295,"{'id': 8, 'us': '8 ', 'metric': 5.0, 'us_steel...",,"{'id': 10, 'us': '10 ', 'metric': 6.0, 'us_ste...",,"[{'id': 9403, 'percentage': 100, 'fiber_type':...",Sport,12
9997,True,4.0,100.0,True,,28.0,Bungee,3.92,53,208,plied,403.0,Plymouth Yarn,50956,"{'id': 3, 'us': '3 ', 'metric': 3.25, 'us_stee...",,,,"[{'id': 61956, 'percentage': 5, 'fiber_type': ...",Fingering,14
9998,True,,454.0,False,,,Rick Rack II,3.84,49,188,textured,1400.0,Interlacements,6483,,,,,"[{'id': 7932, 'percentage': 100, 'fiber_type':...",Lace,'wpi': Non
9999,False,4.0,100.0,True,28.0,28.0,Invicta Glamour,3.82,28,107,plied,376.0,Scheepjes,103047,"{'id': 20, 'us': '2½', 'metric': 3.0, 'us_stee...","{'id': 20, 'us': '2½', 'metric': 3.0, 'us_stee...","{'id': 20, 'us': '2½', 'metric': 3.0, 'us_stee...","{'id': 20, 'us': '2½', 'metric': 3.0, 'us_stee...","[{'id': 201090, 'percentage': 2, 'fiber_type':...",Fingering,14


In [130]:
# yarn_fibers
yarn_fibers_df = yarns_df.yarn_fibers.str.split(", ", expand = True)
yarn_fibers_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,...,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126
0,[{'id': 3430,'percentage': 100,'fiber_type': {'animal_fiber': False,'id': 5,'name': 'Acrylic','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 208,'name': 'Acrylic','permalink': 'acrylic','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,[{'id': 11186,'percentage': 100,'fiber_type': {'animal_fiber': False,'id': 5,'name': 'Acrylic','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 208,'name': 'Acrylic','permalink': 'acrylic','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,[{'id': 1,'percentage': 100,'fiber_type': {'animal_fiber': True,'id': 3,'name': 'Wool','synthetic': False,'vegetable_fiber': False},'fiber_category': {'id': 1,'name': 'Wool','permalink': 'wool'}}],,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,[{'id': 7358,'percentage': 100,'fiber_type': {'animal_fiber': False,'id': 5,'name': 'Acrylic','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 208,'name': 'Acrylic','permalink': 'acrylic','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,[{'id': 9882,'percentage': 100,'fiber_type': {'animal_fiber': True,'id': 24,'name': 'Merino','synthetic': False,'vegetable_fiber': False},'fiber_category': {'id': 22,'name': 'Merino','permalink': 'merino','parent': {'id': 1,'name': 'Wool','permalink': 'wool'}}}],,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [131]:
# limit to info on top 3 fibers per yarn
yarn_fibers_df[39].value_counts()

'id': 5                      27
'id': 3                      25
'vegetable_fiber': False}    24
'id': 24                     23
'id': 1                      19
                             ..
'id': 12                      1
{'id': 8658                   1
'permalink': 'viscose'        1
{'id': 39823                  1
{'id': 186032                 1
Name: 39, Length: 75, dtype: int64

In [132]:
# drop extra columns from the end
yarn_fibers_df = yarn_fibers_df.drop(yarn_fibers_df.iloc[:, 39:], axis = 1)
yarn_fibers_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38
0,[{'id': 3430,'percentage': 100,'fiber_type': {'animal_fiber': False,'id': 5,'name': 'Acrylic','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 208,'name': 'Acrylic','permalink': 'acrylic','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,,,,,,,,,,,,,,
1,[{'id': 11186,'percentage': 100,'fiber_type': {'animal_fiber': False,'id': 5,'name': 'Acrylic','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 208,'name': 'Acrylic','permalink': 'acrylic','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,,,,,,,,,,,,,,
2,[{'id': 1,'percentage': 100,'fiber_type': {'animal_fiber': True,'id': 3,'name': 'Wool','synthetic': False,'vegetable_fiber': False},'fiber_category': {'id': 1,'name': 'Wool','permalink': 'wool'}}],,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,[{'id': 7358,'percentage': 100,'fiber_type': {'animal_fiber': False,'id': 5,'name': 'Acrylic','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 208,'name': 'Acrylic','permalink': 'acrylic','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,,,,,,,,,,,,,,
4,[{'id': 9882,'percentage': 100,'fiber_type': {'animal_fiber': True,'id': 24,'name': 'Merino','synthetic': False,'vegetable_fiber': False},'fiber_category': {'id': 22,'name': 'Merino','permalink': 'merino','parent': {'id': 1,'name': 'Wool','permalink': 'wool'}}}],,,,,,,,,,,,,,,,,,,,,,,,,,


In [133]:
yarn_fibers_df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38
9995,[{'id': 223285,'percentage': 40,'fiber_type': {'animal_fiber': False,'id': 2,'name': 'Nylon','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 216,'name': 'Nylon / Polyamide','permalink': 'nylon','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}},{'id': 223284,'percentage': 60,'fiber_type': {'animal_fiber': False,'id': 5,'name': 'Acrylic','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 208,'name': 'Acrylic','permalink': 'acrylic','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,
9996,[{'id': 9403,'percentage': 100,'fiber_type': {'animal_fiber': False,'id': 2,'name': 'Nylon','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 216,'name': 'Nylon / Polyamide','permalink': 'nylon','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,,,,,,,,,,,,,,
9997,[{'id': 61956,'percentage': 5,'fiber_type': {'animal_fiber': False,'id': 21,'name': 'Other','synthetic': False,'vegetable_fiber': False},'fiber_category': {'id': 239,'name': 'Other','permalink': 'other'}},{'id': 61955,'percentage': 95,'fiber_type': {'animal_fiber': True,'id': 3,'name': 'Wool','synthetic': False,'vegetable_fiber': False},'fiber_category': {'id': 1,'name': 'Wool','permalink': 'wool'}}],,,,,,,,,,,,,,,,,,,
9998,[{'id': 7932,'percentage': 100,'fiber_type': {'animal_fiber': False,'id': 4,'name': 'Rayon','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 218,'name': 'Rayon / Viscose','permalink': 'viscose','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}}],,,,,,,,,,,,,,,,,,,,,,,,,,
9999,[{'id': 201090,'percentage': 2,'fiber_type': {'animal_fiber': False,'id': 6,'name': 'Polyester','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 217,'name': 'Polyester','permalink': 'polyester','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}},{'id': 201089,'percentage': 24,'fiber_type': {'animal_fiber': False,'id': 2,'name': 'Nylon','synthetic': True,'vegetable_fiber': False},'fiber_category': {'id': 216,'name': 'Nylon / Polyamide','permalink': 'nylon','parent': {'id': 207,'name': 'Manufactured Fibers','permalink': 'manufactured-fibers'}}},{'id': 201088,'percentage': 74,'fiber_type': {'animal_fiber': True,'id': 3,'name': 'Wool','synthetic': False,'vegetable_fiber': False},'fiber_category': {'id': 1,'name': 'Wool','permalink': 'wool'}}],,,


In [134]:
yarn_fibers_df[21].value_counts()

'name': 'Wool'                   1441
'name': 'Merino'                  856
'name': 'Manufactured Fibers'     659
'name': 'Acrylic'                 373
'name': 'Nylon / Polyamide'       354
                                 ... 
'percentage': 34                    1
'percentage': 57                    1
'name': 'Hebridean'                 1
'percentage': 90                    1
'name': 'Wensleydale'               1
Name: 21, Length: 98, dtype: int64

Getting info out of yarn_fibers isn't going to work without turning it into a real dictionary first. (Each fiber in a yarn can have up to 13 attributes, but not all have the same number so just spliting into columns to deal with it won't work.) Tried code that worked for the craft column in patterns_df but getting same error as before with other columns.

yarns_df['yarn_fibers'] = yarns_df['yarn_fibers'].apply(lambda x : dict(eval(x)))
temp = yarns_df['yarn_fibers'].apply(pd.Series)
yarns_df = pd.concat([yarns_df, temp], axis = 1).drop('yarn_fibers', axis = 1)
yarns_df.head()

Found this solution of stackoverflow (https://stackoverflow.com/questions/38231591/splitting-dictionary-list-inside-a-pandas-column-into-separate-columns), but it hasn't worked either.

yarn_fiber_df = pd.json_normalize(yarns_df['yarn_fibers'])

I think json_normalize might work if the column was already a true dictionary (nope), but that's the step at which every approach goes wrong.

Tried turning dictionary into a dataframe, but the arrays are different lengths.

In [135]:
yarns_df['yarn_fibers'].loc[2]

"[{'id': 1, 'percentage': 100, 'fiber_type': {'animal_fiber': True, 'id': 3, 'name': 'Wool', 'synthetic': False, 'vegetable_fiber': False}, 'fiber_category': {'id': 1, 'name': 'Wool', 'permalink': 'wool'}}]"

In [136]:
ast.literal_eval(yarns_df['yarn_fibers'].loc[2])

[{'id': 1,
  'percentage': 100,
  'fiber_type': {'animal_fiber': True,
   'id': 3,
   'name': 'Wool',
   'synthetic': False,
   'vegetable_fiber': False},
  'fiber_category': {'id': 1, 'name': 'Wool', 'permalink': 'wool'}}]

In [137]:
type(ast.literal_eval(yarns_df['yarn_fibers'].loc[2]))

list

In [138]:
#thought this might work, but it doesn't do quite what I want

#yarns_df['yarn_fibers'] = yarns_df['yarn_fibers'].apply(lambda x : ast.literal_eval(x))
#temp = yarns_df['yarn_fibers'].apply(pd.Series)
#yarns_df = pd.concat([yarns_df, temp], axis = 1)#.drop('yarn_fibers', axis = 1)
#yarns_df.head()

In [139]:
yarns_df['yarn_fibers'].apply(lambda x : len(ast.literal_eval(x))).value_counts()
# if I understand correctly, this result indicates how many fibers are listed for each yarn

1     4415
2     3933
3     1401
4      215
5       22
0       11
7        1
10       1
9        1
Name: yarn_fibers, dtype: int64

In [140]:
yarn_dict = yarns_df['yarn_fibers'].apply(lambda x : ast.literal_eval(x)[0] if len(ast.literal_eval(x)) > 0 else {})
# properly turns column into a dictionary
# does it only pull the first fiber? I think so. How to change this to pull each yarn firber component in turn?
# I think this is what Mahesh was talking about when he recommended a function.

In [141]:
yarnfiber_df = yarn_dict.apply(pd.Series)
# takes each dictionary element and puts it in a column

In [142]:
yarnfiber_df
# fiber_type and fiber_category are both dictionaries so they need further steps

Unnamed: 0,id,percentage,fiber_type,fiber_category
0,3430.0,100.0,"{'animal_fiber': False, 'id': 5, 'name': 'Acry...","{'id': 208, 'name': 'Acrylic', 'permalink': 'a..."
1,11186.0,100.0,"{'animal_fiber': False, 'id': 5, 'name': 'Acry...","{'id': 208, 'name': 'Acrylic', 'permalink': 'a..."
2,1.0,100.0,"{'animal_fiber': True, 'id': 3, 'name': 'Wool'...","{'id': 1, 'name': 'Wool', 'permalink': 'wool'}"
3,7358.0,100.0,"{'animal_fiber': False, 'id': 5, 'name': 'Acry...","{'id': 208, 'name': 'Acrylic', 'permalink': 'a..."
4,9882.0,100.0,"{'animal_fiber': True, 'id': 24, 'name': 'Meri...","{'id': 22, 'name': 'Merino', 'permalink': 'mer..."
...,...,...,...,...
9995,223285.0,40.0,"{'animal_fiber': False, 'id': 2, 'name': 'Nylo...","{'id': 216, 'name': 'Nylon / Polyamide', 'perm..."
9996,9403.0,100.0,"{'animal_fiber': False, 'id': 2, 'name': 'Nylo...","{'id': 216, 'name': 'Nylon / Polyamide', 'perm..."
9997,61956.0,5.0,"{'animal_fiber': False, 'id': 21, 'name': 'Oth...","{'id': 239, 'name': 'Other', 'permalink': 'oth..."
9998,7932.0,100.0,"{'animal_fiber': False, 'id': 4, 'name': 'Rayo...","{'id': 218, 'name': 'Rayon / Viscose', 'permal..."


In [143]:
#apply_test2 = temp['fiber_type'].apply(lambda x : ast.literal_eval(x) if len(ast.literal_eval(x)) > 0 else {})
yarnfiber_type_df = yarnfiber_df['fiber_type'].apply(pd.Series)
yarnfiber_type_df
# repeat .apply to breakout formerly nested dictionaries

Unnamed: 0,0,animal_fiber,id,name,synthetic,vegetable_fiber
0,,False,5.0,Acrylic,True,False
1,,False,5.0,Acrylic,True,False
2,,True,3.0,Wool,False,False
3,,False,5.0,Acrylic,True,False
4,,True,24.0,Merino,False,False
...,...,...,...,...,...,...
9995,,False,2.0,Nylon,True,False
9996,,False,2.0,Nylon,True,False
9997,,False,21.0,Other,False,False
9998,,False,4.0,Rayon,True,False


Mahesh recommends a function to accomplish turning problem columns into dictionaries and then breaking them into constituent parts. Also, use the function in place of lambda.

What are the steps?


In [144]:
yarnfiber_category_df = yarnfiber_df['fiber_category'].apply(pd.Series)
yarnfiber_category_df

Unnamed: 0,0,id,name,parent,permalink
0,,208.0,Acrylic,"{'id': 207, 'name': 'Manufactured Fibers', 'pe...",acrylic
1,,208.0,Acrylic,"{'id': 207, 'name': 'Manufactured Fibers', 'pe...",acrylic
2,,1.0,Wool,,wool
3,,208.0,Acrylic,"{'id': 207, 'name': 'Manufactured Fibers', 'pe...",acrylic
4,,22.0,Merino,"{'id': 1, 'name': 'Wool', 'permalink': 'wool'}",merino
...,...,...,...,...,...
9995,,216.0,Nylon / Polyamide,"{'id': 207, 'name': 'Manufactured Fibers', 'pe...",nylon
9996,,216.0,Nylon / Polyamide,"{'id': 207, 'name': 'Manufactured Fibers', 'pe...",nylon
9997,,239.0,Other,,other
9998,,218.0,Rayon / Viscose,"{'id': 207, 'name': 'Manufactured Fibers', 'pe...",viscose


In [145]:
yarnfiber_parent_df = yarnfiber_category_df['parent'].apply(pd.Series)
yarnfiber_parent_df

Unnamed: 0,0,id,name,permalink
0,,207.0,Manufactured Fibers,manufactured-fibers
1,,207.0,Manufactured Fibers,manufactured-fibers
2,,,,
3,,207.0,Manufactured Fibers,manufactured-fibers
4,,1.0,Wool,wool
...,...,...,...,...
9995,,207.0,Manufactured Fibers,manufactured-fibers
9996,,207.0,Manufactured Fibers,manufactured-fibers
9997,,,,
9998,,207.0,Manufactured Fibers,manufactured-fibers
