# SERPgen training data cleaning and preparation

This Jupyter Notebook guides you through the process of cleaning and processing training data for the K2T-model, a variant of the T5 model. In this particular example, the model is trained for the purpose of generating a sequence (sentence) based on keywords input by the user. Thus, the training data needs to be of the corresponding format (X = keywords, y = sentence). 

While the two datasets at hand (google search results/SERP text and ecommerce product descriptions) already provide beautiful sentences (copy/marketing-lingo), the X data (keywords) for each y first needs to be manually generated. Here, we can make use of the KeyBERT library, which will be finetuned to deliver the best possible keywords (similar to what a user would input).

Before that, however, we will have to choose just 1-2 sentences each, ideally a sentences that points out the product/service attributes and/or a call-to-action (CTA). For this sentences filtering process, we will use the spaCy NLP-library.

## Imports

In [25]:
#!pip install spacy
#!python -m spacy download en_core_web_lg
#!pip install spacy-transformers
#!pip install keybert 

from keybert import KeyBERT
import spacy
import pandas as pd
import numpy as np

## Model testing (keyword extraction)

In [2]:
model = KeyBERT(model="distilbert-base-nli-mean-tokens")

### Sample Input

In [59]:
doc1 = "Yoga mats (also called sticky mats) are used in most yoga classes to provide cushioning and traction. While you can usually rent a mat at a ..."
doc2 = "Plan for retirement, learn how to invest, and more. Access our investor education resources to get started or further develop your investing and trading strategies."
doc3 = "Computer Desk With Hutch And Bookshelf, 47 Inches Home Office Desk With Space Saving Design For Small Spaces (Dark Walnut) Inbox Zero Color"
doc4 = "The California Department of Education provides leadership, assistance, oversight and resources so that every Californian has access to an education. Another sentence starts here ... May I introduce a very long sentence that will hopefully be detected by the function because it starts with a verb but has a question mark at the end and is artificially made longer so that it is the longest sentence?"
doc5 = "This is a list of sentences. Do some have rethoric questions? I don't know... This is a sentence about yoga mats though! Try out now this super long sentence which hopfefully gets sorted out because it is the longest one."
doc6 = "We are a nonprofit, non-partisan communications organization dedicated to building support for student-focused improvements in public education"
doc7 = "Agame.com is the best place to go if you're searching for a variety of popular games to play online. New free games are always being added, too!"

### Sample Output

#### Here you can test different parameters and the generated keywords

In [8]:
output1 = model.extract_keywords(doc1, top_n=10, keyphrase_ngram_range=(1, 2), use_maxsum = False, stop_words="english", use_mmr=False, diversity=0.9)
output1

[('yoga mats', 0.5897),
 ('yoga classes', 0.5886),
 ('used yoga', 0.4376),
 ('yoga', 0.4116),
 ('sticky mats', 0.3777),
 ('mats used', 0.3537),
 ('cushioning traction', 0.2746),
 ('classes provide', 0.2717),
 ('provide cushioning', 0.2704),
 ('traction usually', 0.2512)]

In [None]:
output2 = model.extract_keywords(doc2, top_n=10, keyphrase_ngram_range=(1, 2), stop_words="english")
output2

[('investor education', 0.5795),
 ('develop investing', 0.5221),
 ('investing trading', 0.5068),
 ('plan retirement', 0.4869),
 ('retirement learn', 0.4866),
 ('trading strategies', 0.4255),
 ('learn invest', 0.4131),
 ('investing', 0.382),
 ('access investor', 0.3806),
 ('education resources', 0.3363)]

In [None]:
output3 = model.extract_keywords(doc3, top_n=10, keyphrase_ngram_range=(1, 2), stop_words="english")
output3

[('computer desk', 0.5546),
 ('office desk', 0.5493),
 ('desk hutch', 0.4791),
 ('desk space', 0.4788),
 ('desk', 0.4281),
 ('walnut inbox', 0.4236),
 ('dark walnut', 0.4195),
 ('hutch bookshelf', 0.3808),
 ('bookshelf', 0.3786),
 ('bookshelf 47', 0.378)]

In [None]:
output4 = model.extract_keywords(doc4, top_n=10, keyphrase_ngram_range=(1, 2), use_maxsum = False, stop_words="english", use_mmr=True, diversity=0.8)

In [None]:
output4

[('resources californian', 0.3284),
 ('longest sentence', 0.2843),
 ('education provides', 0.1165),
 ('verb question', -0.0048),
 ('function starts', 0.1336),
 ('mark end', -0.0819),
 ('hopefully detected', 0.0948),
 ('oversight resources', -0.0833),
 ('artificially', -0.1147),
 ('artificially longer', 0.051)]

## Keybert & spaCy to extract most important keywords

In [4]:
#spacy.prefer_gpu()   #uncomment if you want to use GPU instead. If so, load "en_core_web_trf" (only for GPU) as spacy model instead.

nlp = spacy.load("en_core_web_lg", exclude=['parser', 'attribute_ruler', 'lemmatizer'])  #tokenizer, sentencizer, tagger & ner included
nlp.add_pipe('sentencizer')

kw_model = KeyBERT(model=nlp)   #Use KeyBERT based on spaCy model

In [10]:
#Showing the POS (part of speech) for each token in the NLP'ed text
verbs = [(token, token.tag_) for token in nlp(doc7)]
print(verbs)

[(Agame.com, 'NNP'), (is, 'VBZ'), (the, 'DT'), (best, 'JJS'), (place, 'NN'), (to, 'TO'), (go, 'VB'), (if, 'IN'), (you, 'PRP'), ('re, 'VBP'), (searching, 'VBG'), (for, 'IN'), (a, 'DT'), (variety, 'NN'), (of, 'IN'), (popular, 'JJ'), (games, 'NNS'), (to, 'TO'), (play, 'VB'), (online, 'RB'), (., '.'), (New, 'JJ'), (free, 'JJ'), (games, 'NNS'), (are, 'VBP'), (always, 'RB'), (being, 'VBG'), (added, 'VBN'), (,, ','), (too, 'RB'), (!, '.')]


In [11]:
#Detecting named entities in sentece
for word in nlp(doc7).ents:    #ents = entities
    print(word.text,word.label_)

Agame.com ORG


In [46]:
#this is an optional function that can be added to the get_keywords function in order to detect and remove entities such as website names, prices or discounts of goods, etc 
def remove_ner(text):
    doc = nlp(text)
    with doc.retokenize() as retokenizer:
        for e in doc.ents:
            retokenizer.merge(doc[e.start:e.end])
    tok_pairs = [(tok.text, tok.whitespace_) for tok in doc]
    ents = [e.text for e in doc.ents if e.label_ in ("ORG", "MONEY", "ORDINAL","PERCENT")]
    print(f"Entities to be removed from sentence: {ents}")
    tok_pairs_out = [pair for pair in tok_pairs if pair[0] not in ents]
    return "".join(np.array(tok_pairs_out).ravel())

In [47]:
remove_ner(doc7)

Entities to be removed from sentence: ['Agame.com']


"is the best place to go if you're searching for a variety of popular games to play online. New free games are always being added, too!"

In [53]:
import itertools

def get_keywords(text):  
    text = remove_ner(text)
    
    #use KeyBERT (using spacy model) to extract top 5 keyword (bi-grams), exclude stop-words (which would be of no use as keywords)
    keywords = kw_model.extract_keywords(text, top_n=5, keyphrase_ngram_range=(1, 2), use_maxsum = False, stop_words="english", use_mmr=True, diversity=2)
   
    #order output based on score, merge to one list
    kw_sorted = sorted(keywords, key=lambda tup:(-tup[1], tup[0]))
    kw_list = [tup[0].split(" ") for tup in kw_sorted]
    kw_list2 = list(itertools.chain.from_iterable(kw_list))
    
    #iterate through list and add only those not in the list yet (unique words only)
    output_list = []
    for word in kw_list2:
        if word not in output_list:
            output_list.append(word)
    return (" ".join(output_list[:5]))

In [54]:
#test output
get_keywords(doc7)


Entities to be removed from sentence: ['Agame.com']


'best place free games searching'

## Filtering out just 1 sentence with NLP, cleaning it with RegEx

In [55]:
import re

#example of elements at head of google SERP text that needs to be removed (here date)
TestString = "Sep 26, 2019 - Interactive Brokers will offer unlimited free trades on stocks and ETFs via IBKR Lite."
TestString2 = "23 hours ago - The Starz drama just aired its season 6 mid-season finale, and fans on Twitter have made it one of the most engaged with shows on TV."

#defining function to detect dates at snippet head and remove it
def remove_date(string):
    pattern = '[a-zA-Z]+\s(0?[1-9]|[12][0-9]|3[01]),\s+[0-9]+\s+-\s|[0-9]+ [a-zA-Z]+\s[a-zA-Z]+\s+-\s'
    mod_string = re.sub(pattern, '', string)
    return mod_string

In [82]:
def choose_best_sentence(text):
    #Finds sentences with a verb at the beginning (imperative)- unless there is a question mark at the end (rhetorical question)
    text = str(text)
    for sent in nlp(text).sents:
        for word in sent:
            pass
            #print(word)
    nlp_phrases = [sent for sent in nlp(text).sents if sent[0].tag_ != 'VB' and (sent.text[-1] != "?")]
    
    #removes sentences cut off mid-sentence by Google (...)
    #nlp_phrases2 = [sent for sent in nlp(nlp_phrases).sents if "..." not in sent.text]
    #print(nlp_phrases2)
        
    #finds longest sentence
    if nlp_phrases:
        longest_sent = max(nlp_phrases, key=len)
        #returns str object
        return remove_date(longest_sent.text)
    else:
        return "NaN"

In [83]:
test4 = choose_best_sentence(doc4)
test4

'The California Department of Education provides leadership, assistance, oversight and resources so that every Californian has access to an education.'

## Import dataset and apply cleaning functions

In [84]:
df = pd.read_csv("../../final project/data/serp data/SERP_dataset Keggle/SERP_dataset.csv")

In [None]:
pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_rows', 500)

In [None]:
df["keyword"].value_counts()  #show different keyword categories and number of sentences for each

File Sharing and Storage              102
Proxy Avoidance                       102
Illegal or Unethical                  102
Extremist Groups                      102
Spam URLs                             101
Newsgroups and Message Boards         101
Web-based Applications                101
Domain Parking                        101
Web Chat                              101
Personal Privacy                      101
Remote Access                         101
Explicit Violence                     101
Newly Observed Domain                 101
Dynamic DNS                           101
Malicious Websites                    101
Dynamic Content                       100
Child Education                       100
Other Adult Materials                 100
Nudity and Risque                     100
Content Servers                       100
Tobacco                               100
Finance and Banking                   100
Secure Websites                       100
Society and Lifestyles            

In [None]:
df[["keyword", "title", "description"]].sample(frac=1).groupby('keyword', sort=False).head(5)

Unnamed: 0,keyword,title,description
6822,Discrimination,"For Discrimination: Race, Affirmative Action, and the Law ...",For Discrimination is at once the definitive reckoning with one of America's most explosively contentious and divisive issues and a principled work of advocacy ...
2037,Lingerie and Swimsuit,The Mom Edit's 2019 Swimsuit Guide,"The coolest bathing suits (plunge, monokinis, cut-outs) — for all body types (even hiding a post-baby belly), are here in our yearly swimwear roundup."
7100,Extremist Groups,What Is a Political Extremist? - ThoughtCo,"Jul 2, 2019 - One prominent environmental extremist group has described its mission as using “economic sabotage and guerrilla warfare to stop the ..."
7555,Dynamic DNS,Free Dynamic DNS - No-IP.com - Managed DNS Services,Free Dynamic DNS service provider since 1999. Point your dynamic IP address to a static hostname for free. Sign up today!
6331,Sports,Sports | The Seattle Times,"The Seattle Times Sports section covers teams and athletes in the Seattle area and Pacific Northwest, including the Seahawks, Mariners, UW Huskies, WSU ..."
5147,Child Education,Save Receipts For Your Child's Education Expenses,Minnesota has two programs to help you pay for your child's education expenses. ... and the K–12 Education Credit can lower the tax you pay or increase your ...
4095,Online Meeting,Find an Al-Anon Electronic Meeting,"Oct 10, 2019 - Electronic Al-Anon meetings are held on several formats including Phone, Email, Chat, Blog, Bulletin Board, Instant Messaging, Web ..."
3302,Armed Forces,Armed Forces News | FEDweek,"News and resources for US military and civilian defense personnel and retirees including force developments, benefits, discounts and more."
1004,Newsgroups and Message Boards,Newsgroup Definition - TechTerms.com,"Mar 30, 2017 - A newsgroup is an online discussion forum accessible through Usenet. Each newsgroup contains discussions about a specific topic, indicated ..."
6634,Web-based Email,Web and Email DLP | Forcepoint,Web and Email DLP helps you avoid data breaches by enabling you to discover and protect sensitive data in the cloud or on-premise. Use custom or ...


In [None]:
df_cleaned = df[["keyword", "title", "description"]] #only keep these 3 columns for data preview
df_cleaned["description"] = df_cleaned["description"].map(choose_best_sentence) #FIRST MAJOR CLEANING STEP
df_cleaned.sample(frac=1).groupby('keyword', sort=False).head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned["description"] = df_cleaned["description"].map(choose_best_sentence) #FIRST MAJOR CLEANING STEP


Unnamed: 0,keyword,title,description
3404,Business,BSR: Home,We are a global nonprofit business network and consultancy dedicated to sustainability.
5708,Real Estate,Marketplace - BiggerPockets Marketplace | Real Estate ...,The BiggerPockets marketplace is the only place on BiggerPockets where you can advertise residential or commercial real estate for sale.
3473,Charitable Organizations,What is the Difference between a Nonprofit Organization and ...,How can you determine if a nonprofit is a charity?
5651,Political Organizations,Women's Participation in Violent Political Organizations ...,"Women's Participation in Violent Political Organizations - Volume 109 Issue 3 - JAKANA L. THOMAS, KANISHA D. BOND."
6415,Travel,"Travel Collection — Aer | Modern gym bags, travel backpacks and ...","The All-New Travel Collection is designed for a smarter, streamlined travel experience."
6343,Sports,"Detroit sports, Mich. sports, high school sports - Detroit News ...","The Detroit News delivers news, analysis, and scores for Michigan, Michigan State, Lions, Tigers, Red Wings, Pistons, and high school sports."
6340,Sports,Home - BBC Sport,"Breaking news & live sports coverage including results, video, audio and analysis on Football, F1, Cricket, Rugby Union, Rugby League, Golf, Tennis and all the ..."
4429,Web Analytics,Best Web Analytics Software | 2019 Reviews of the Most ...,"Free, interactive tool to quickly narrow your choices and contact multiple vendors."
2769,Weapons (Sales),The US brought in $192.3 billion from weapon sales last year ...,"WASHINGTON — Combined weapon sales from American companies for fiscal 2018 were up 13 percent over fiscal 2017 figures, netting ..."
532,Instant Messaging,Instant Messaging for Business: Your 10 Best Options (Oct 2019),Here is a list ot the best Instant Messaging Programs & Messaging Tools for Business Messaging ...


In [None]:
df_cleaned.shape

(8161, 3)

In [None]:
#this will take super long
df_cleaned["generated_keywords"] = df_cleaned["description"].map(get_keywords) #SECOND MAJOR CLEANING STEP
df_cleaned.sample(frac=1).groupby('keyword', sort=False).head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned["generated_keywords"] = df_cleaned["description"].map(get_keywords) #SECOND MAJOR CLEANING STEP


Unnamed: 0,keyword,title,description,generated_keywords
646,Job Search,Quick Job Search - Connecting Colorado,Workforce Centers | Unemployment Insurance Benefits | Labor Market Information ...,benefits labor insurance market information
4341,Secure Websites,The Complete Beginner's Guide to Secure Website ...,There is one less common way that not many people think about that can help reach this goal is a secure website with SSL encryption…which ...,way people secure website goal
6513,Web Chat,Web chat | SGIC Insurance,"8am–5pm, Mon-Fri.",mon fri 8am 5pm
2903,File Sharing and Storage,Document Sharing and Storage: Information Technology ...,Northwestern provides a variety of file sharing solutions for effective collaboration among members of the ...,provides variety members collaboration file
2071,Lingerie and Swimsuit,Honey Ryder Swimsuit | Honey Birdette Swimwear | Sexy ...,Honey Ryder Swimsuit | Honey Birdette Swimwear | Sexy Swimwear | Bikini Set | One Piece Swimsuit | Hot Swimwear | Luxury Swimwear.,swimwear sexy honey set ryder
1709,Alternative Beliefs,Philosophy: The Study of Alternative Beliefs: Neal W. Klausner ...,"Philosophy: The Study of Alternative Beliefs [Neal W. Klausner, Paul G. Kuntz] on Amazon.com. *",philosophy study com neal klausner
1308,Personal Websites and Blogs,20 Personal Websites to Inspire Job Seekers | Deputy®,"Check out this list of the best personal websites to get started and help you land ... Your website can be a blog, a portfolio of your work, a way to ...",work way blog websites land
156,Entertainment,Entertainment | The Epoch Times,,
3025,Internet Telephony,What Is Internet Telephony? - TeleDynamic Communications,"In my day-to-day role, the most common question for people considering a new business telephone system in the San Francisco Bay Area is, ...",people considering role bay telephone
7049,Extremist Groups,Extremist Groups - Wikipedia,,


In [None]:
#get rid of rows with NaN
df_cleaned.dropna(how='any', inplace=True)
df_cleaned = df_cleaned[~df_cleaned['generated_keywords'].isin(['nan', 'NaN', "['nan']"])]
df_cleaned.sample(frac=1).groupby('keyword', sort=False).head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned.dropna(how='any', inplace=True)


Unnamed: 0,keyword,title,description,generated_keywords
2332,Other Adult Materials,Why Is Tumblr Banning Adult Content? Censorship Causes ...,Why is Tumblr banning adult content and what alternatives are there to ... who wanted a place to share their world with other like-minded folks.,wanted place adult alternatives banning
5443,Domain Parking,Setting up Domain Parking on cPanel - Help Center ...,Setting up domain parking is super simple on cPanel.,setting domain simple parking super
708,Meaningless Content,"This website is blocked by Fortinet: ""Category: Meaningless ...","This website is blocked by Fortinet: ""Category: Meaningless Content"".",meaningless content website category blocked
698,Meaningless Content,Has Social Media Made Content Meaningless? - Skyword,"That's because, according to the app's creator, meaningful content isn't the point anymore.",point anymore according app creator
6927,Drug Abuse,Drug Addiction in Health Care Professionals,The abuse of prescription drugs—especially controlled substances—is a serious social and health problem in the United States today.,drugs especially social today controlled
6221,Society and Lifestyles,Who's employed by the lifestyles of the rich and famous?,"... and because it presages at least one vision of the ""future of work"" at a time when society is wondering what such ""new work"" will look like.",work look new society vision
4678,Advertising,Advertising | Definition of Advertising at Dictionary.com,"Advertising definition, the act or practice of calling public attention to one's product, service, need, etc.,",service need practice advertising definition
5325,Digital Postcards,Postcards of the Midlands - Local History Digital Collections ...,"The South Carolina Postcards collection includes early 20th-century postcards and souvenir folders depicting historic images of Columbia, Richland County, ...",century postcards south includes folders
1342,Personal Websites and Blogs,8 Reasons Why You NEED a Personal Website (and Blog ...,Having a personal website and blog has so many benefits.,having personal blog benefits website
8002,Phishing,phishing - Mashable,"The latest articles about phishing from Mashable, the media and tech company.",media tech latest articles phishing


In [None]:
df_cleaned.shape

(7608, 4)

In [None]:
len(df_cleaned["generated_keywords"].iloc[0])

35

In [None]:
#remove all other columns except X and y, save df to csv file
#df_cleaned = df_cleaned[["description", "generated_keywords"]]     #use normally
df_cleaned_B = df_cleaned.copy()
df_cleaned_B = df_cleaned_B[df_cleaned_B['generated_keywords'].str.len() > 15]           #df[df['email'].str.len()<51]                                    #remove those 
df_cleaned_B.to_csv("cleaned_data4.csv")
df_cleaned_B.shape

(7527, 2)

In [None]:
df_cleaned_B.head(50)

Unnamed: 0,description,generated_keywords
0,A crew of experienced educators helms our vast and growing library.,vast growing library educators crew
2,"Education is the process of facilitating learning, or the acquisition of knowledge, skills, values, beliefs, and habits.",education process values beliefs facilitating
3,Education Week talked to dozens of principals about strategies for building strong partnerships with teachers and heard from many teachers about what they ...,heard teachers strong building dozens
5,"Education, discipline that is concerned with methods of teaching and learning in schools or school-like environments as opposed to various nonformal and ...",school like discipline methods environments
6,"With New York sharply divided over gifted education, some parents point to a school in Harlem as proof that selective schools can be integrated.",point school integrated selective sharply
7,Forbes is a leading source for reliable news and updated analysis on Education.,reliable news analysis education updated
8,Fierce advocates for the high academic achievement of all students — particularly those of color or living in poverty.,students particularly poverty achievement fierce
9,Stanford Graduate School of Education (GSE) is a leader in pioneering new and better ways to achieve high-quality education for all.,quality education leader pioneering stanford
10,"The California Department of Education provides leadership, assistance, oversight and resources so that every Californian has access to an education that ...",education provides department oversight california
11,"We are a nonprofit, non-partisan communications organization dedicated to building support for student-focused improvements in public education from ...",public education communications improvements non


## Applying functions on e-commerce(500) product dataset

In [None]:
df2 = pd.read_csv("../../final project/data/serp data/sample-data.csv")

In [None]:
df2.head()

Unnamed: 0,id,description
0,1,"Active classic boxers - There's a reason why our boxers are a cult favorite - they keep their cool, especially in sticky situations. The quick-drying, lightweight underwear takes up minimal space in a travel pack. An exposed, brushed waistband offers next-to-skin softness, five-panel constructio..."
1,2,"Active sport boxer briefs - Skinning up Glory requires enough movement without your boxers deciding to poach their own route. The form-fitting Active Sport Boxer Briefs are made from breathable 93% polyester (71% recycled) fabric that's fast-wicking, dries quickly and has 7% spandex for stretch;..."
2,3,"Active sport briefs - These superbreathable no-fly briefs are the minimalist's choice for high-octane endeavors. Made from a blend of fast-wicking, quick-drying 93% polyester (71% recycled) and 7% spandex that has both stretch-mesh (for support) and open mesh (for cooling airflow). Soft edging a..."
3,4,"Alpine guide pants - Skin in, climb ice, switch to rock, traverse a knife-edge ridge and boogie back down - these durable, weather-resistant and breathable soft-shell pants keep stride on every mountain endeavor. The midweight stretch-woven polyester won't restrict your moves, and the brushed in..."
4,5,"Alpine wind jkt - On high ridges, steep ice and anything alpine, this jacket serves as a true ""best of all worlds"" staple. It excels as a stand-alone shell for blustery rock climbs, cool-weather trail runs and high-output ski tours. And then, when conditions have you ice and alpine climbing, it ..."
5,6,"Ascensionist jkt - Our most technical soft shell for full-on mountain pursuits strikes the alpinist's balance between protection and minimalism. The dense 2-way-stretch polyester double weave, with stitchless seams, has exceptional water- and wind-resistance, a rapid dry time and superb breathab..."
6,7,"Atom - A multitasker's cloud nine, the Atom plays the part of courier bag, daypack and carry-on. Its teardrop shape provides the support of a daypack by positioning the load behind your shoulder, and the single-strap design makes getting to the goods simple - just spin it around front. The large..."
7,8,"Print banded betina btm - Our fullest coverage bottoms, the Betina fits highest across the hips with a slightly scooped, lined front. Made from a blend of 82% nylon/18% spandex.<br><br><b>Details:</b><ul> <li>Fullest coverage bottom</li> <li>""Fits highest across hips, full front and back coverag..."
8,9,"Baby micro d-luxe cardigan - Micro D-Luxe is a heavenly soft fabric with down-to-earth applications. This cardigan is made from a quick-drying, durable 4.6-oz 100% polyester (87% recycled) microdenier fleece that is lightweight and breathable so it can work as a top or midlayer. A wind flap back..."
9,10,"Baby sun bucket hat - This hat goes on when the sun rises above the horizon, and stays on when raindrops start falling. Its made from an ultra-durable 4-ply, 4.2-oz Supplex nylon fabric with a DWR (durable water repellent) finish, and reverses to either a contrasting solid color or print. A soft..."


In [None]:
def choose_best_sentence_2(text):
    #optimized for fashion dataset
    #turn sentences into spaCy NLP sentences and filters out those with a verb at the beginning - unless there is a question mark at the end
    #as well as those cut off mid-sentence by Google (...)
    pattern = "^([a-zA-Z]+( [a-zA-Z]+)+)-shirt\s-\s|^([a-zA-Z]+( [a-zA-Z]+)+)\s-\s|^[a-zA-Z]+\s[0-9]+\s[a-zA-Z]+\s[a-zA-Z]+\s-\s|long\s-\s|max\s-\s|reg\s-\s|^[a-zA-Z]+\s-\s|^[a-zA-Z]+\s-\s|^[a-zA-Z]+'([a-zA-Z]+( [a-zA-Z]+)+)\s-\s"
    pattern2 = "[a-zA-Z]+\s[0-9]+\s[a-zA-Z]+\st-shirt|[a-zA-Z]+\s[0-9]+\s[a-zA-Z][a-zA-Z][a-zA-Z]\s-\s|long\s-\s|reg\s-\s|[a-zA-Z]+\st-shirt\s+-\s|[\w\s]{1,3}\s-\s|Girl'|M10\s|S/|[a-zA-Z]+\s[0-9]+\st-[a-zA-Z]+|[a-zA-Z]+\s[0-9]+\sbott|Cap\s[0-9]+\s[a-zA-Z]+"
    text = str(text)
    nlp_phrases = [sent for sent in nlp(text).sents]
    first_sent = re.sub(pattern, '', str(nlp_phrases[0]), flags=re.M)
    
    #if sent[0].tag_ != 'VB' and ("..." not in sent.text or sent.text[-1] != "?")]
    
    #finds longest sentence
    if nlp_phrases[0]:
        #longest_sent = max(nlp_phrases, key=len)
        #returns str object
        return first_sent
    else:
        return "NaN"

In [None]:
test_reg = "Girl's cotton tank dress - This soft cotton dress feels breezy and cool, just like a day at the beach."
choose_best_sentence_2(test_reg)

In [None]:
df2["description"] = df2["description"].map(choose_best_sentence_2)
df2["generated_keywords"] = df2["description"].map(get_keywords)
df2.reset_index(drop=True, inplace=True) 
df2.head(50)

Unnamed: 0,id,description,generated_keywords
0,1,"There's a reason why our boxers are a cult favorite - they keep their cool, especially in sticky situations.",cool especially situations sticky cult
1,2,Skinning up Glory requires enough movement without your boxers deciding to poach their own route.,glory requires movement boxers skinning
2,3,These superbreathable no-fly briefs are the minimalist's choice for high-octane endeavors.,choice high minimalist briefs octane
3,4,"Skin in, climb ice, switch to rock, traverse a knife-edge ridge and boogie back down - these durable, weather-resistant and breathable soft-shell pants keep stride on every mountain endeavor.",skin climb switch breathable endeavor
4,5,"On high ridges, steep ice and anything alpine, this jacket serves as a true ""best of all worlds"" staple.",true best ice jacket staple
5,6,Our most technical soft shell for full-on mountain pursuits strikes the alpinist's balance between protection and minimalism.,technical soft strikes pursuits minimalism
6,7,"A multitasker's cloud nine, the Atom plays the part of courier bag, daypack and carry-on.",carry plays atom daypack multitasker
7,8,"Our fullest coverage bottoms, the Betina fits highest across the hips with a slightly scooped, lined front.",highest hips betina fits coverage
8,9,Baby micro d-luxe cardigan - Micro D-Luxe is a heavenly soft fabric with down-to-earth applications.,soft fabric heavenly luxe cardigan
9,10,"This hat goes on when the sun rises above the horizon, and stays on when raindrops start falling.",goes sun stays rises hat


In [None]:
df2.dropna(how='any', inplace=True)
df2= df2[~df2['generated_keywords'].isin(['nan', 'NaN', "['nan']"])]
df2 = df2[df2['generated_keywords'].str.len() > 15] 
df2.to_csv("cleaned_data4CLOTH.csv")

In [None]:
data1= df_cleaned_B[["description", "generated_keywords"]]
data2= df2[["description", "generated_keywords"]]
df_final = pd.concat([data1, data2], axis=0)                     #merge both cleaned data sets
df_final.shape
df_final.to_csv("cleaned_data_final.csv")