# Koombea blogs db Extract and transform

I this notebook I will take a look to koombea db and select the best queries to obtain the require information about it for our recommender system and search engine.

In [4]:
import sys
sys.path.append('../koombea_blogs') # add koombea_blogs module
from config.settings import settings # import settings
from db.engine import engine # import engine
from tqdm import tqdm
import pandas as pd
import phpserialize as phps
from functools import lru_cache
from io import BytesIO
import many_stop_words
import bs4
import re
import spacy

conn = engine.connect() # connect to db

## What information do we need for the search system

The information that we need for building the response is:
- id
- slug
- link
- title
- post_modified
- post_date
- author
- industry
- content_type
- image_alt
- image
- content
- industry_term
- content_type_term
- lang

And the information that we need for building the search system using **NLU** techniques is:
- slug
- title
- content

## Get tables that are not empty

Some wp tables are empty, these ones are not useful, so we don't need to analyze them.

In [2]:
@lru_cache
def get_not_empty_tables(conn, db_name):
    """ Return all the informatin from not empty tables
    
    Args:
        conn: connection object to execute mySql queries
        db_name: db_name string value
    Returns:
        tables_info: dictionary with key (String: TableName) and value (pd.DataFrame)
    """
    # Get table names
    table_names = map(lambda x: x[0], conn.execute('show tables').all())
    # Loop throught all the tables, check if it is not empty, and save the info
    tables_info = {}
    for table_name in tqdm(table_names, desc="getting tables from {}".format(db_name)):
        table_info = pd.read_sql_table(table_name, conn)
        if table_info.shape[0] != 0:
            tables_info[table_name] = table_info
    return tables_info

In [3]:
tables = get_not_empty_tables(conn, settings.DB_NAME)

getting tables from wp_koombea20stg: 60it [00:35,  1.71it/s]


In [4]:
tables.keys()

dict_keys(['wp_aiowps_failed_logins', 'wp_aiowps_login_activity', 'wp_aiowps_login_lockdown', 'wp_options', 'wp_pmxe_google_cats', 'wp_pmxi_files', 'wp_pmxi_history', 'wp_pmxi_images', 'wp_pmxi_imports', 'wp_pmxi_posts', 'wp_pmxi_templates', 'wp_postmeta', 'wp_posts', 'wp_redirection_groups', 'wp_redirection_items', 'wp_term_relationships', 'wp_term_taxonomy', 'wp_termmeta', 'wp_terms', 'wp_usermeta', 'wp_users', 'wp_wpefc_calendarCategories', 'wp_wpefc_calendars', 'wp_wpefc_forms', 'wp_wpefc_items', 'wp_wpefc_links', 'wp_wpefc_settings', 'wp_wpefc_steps'])

The tables with information are:

|    | table_names                 |
|---:|:----------------------------|
|  0 | wp_aiowps_failed_logins     |
|  1 | wp_aiowps_login_activity    |
|  2 | wp_aiowps_login_lockdown    |
|  3 | wp_options                  |
|  4 | wp_pmxe_google_cats         |
|  5 | wp_pmxi_files               |
|  6 | wp_pmxi_history             |
|  7 | wp_pmxi_images              |
|  8 | wp_pmxi_imports             |
|  9 | wp_pmxi_posts               |
| 10 | wp_pmxi_templates           |
| 11 | wp_postmeta                 |
| 12 | wp_posts                    |
| 13 | wp_redirection_groups       |
| 14 | wp_redirection_items        |
| 15 | wp_term_relationships       |
| 16 | wp_term_taxonomy            |
| 17 | wp_termmeta                 |
| 18 | wp_terms                    |
| 19 | wp_usermeta                 |
| 20 | wp_users                    |
| 21 | wp_wpefc_calendarCategories |
| 22 | wp_wpefc_calendars          |
| 23 | wp_wpefc_forms              |
| 24 | wp_wpefc_items              |
| 25 | wp_wpefc_links              |
| 26 | wp_wpefc_settings           |
| 27 | wp_wpefc_steps              |

## Analyze `wp_posts` table

There is a table called `wp_posts` these have some useful information about posts at `Koombea.com`

In [5]:
# Let's see the columns
wp_posts = tables['wp_posts']
wp_posts.columns

Index(['ID', 'post_author', 'post_date', 'post_date_gmt', 'post_content',
       'post_title', 'post_excerpt', 'post_status', 'comment_status',
       'ping_status', 'post_password', 'post_name', 'to_ping', 'pinged',
       'post_modified', 'post_modified_gmt', 'post_content_filtered',
       'post_parent', 'guid', 'menu_order', 'post_type', 'post_mime_type',
       'comment_count'],
      dtype='object')

The columns of `wp_posts` table are:

|    | wp_posts_columns      |
|---:|:----------------------|
|  0 | ID                    |
|  1 | post_author           |
|  2 | post_date             |
|  3 | post_date_gmt         |
|  4 | post_content          |
|  5 | post_title            |
|  6 | post_excerpt          |
|  7 | post_status           |
|  8 | comment_status        |
|  9 | ping_status           |
| 10 | post_password         |
| 11 | post_name             |
| 12 | to_ping               |
| 13 | pinged                |
| 14 | post_modified         |
| 15 | post_modified_gmt     |
| 16 | post_content_filtered |
| 17 | post_parent           |
| 18 | guid                  |
| 19 | menu_order            |
| 20 | post_type             |
| 21 | post_mime_type        |
| 22 | comment_count         |

In this table from the db there are the blogs but other type of posts as well, such as industries, services etc.

In [6]:
wp_posts['post_type'].unique()

array(['page', 'nav_menu_item', 'attachment', 'industries', 'members',
       'services', 'case_studies', 'locations', 'bdm-animations', 'post',
       'revision', 'tech_stack'], dtype=object)

## Published Blog posts

To select the blog posts type, we need to filter the query by `post_type` values and by `post_status`

In [7]:
# Let's define the columns needed for our blogs dataset
blogs_columns = ['ID', 'post_author', 'post_date', 'post_content', 'post_title',
                 'post_excerpt', 'post_name', 'post_modified']
blogs = wp_posts[(wp_posts['post_type'] == 'post') & (wp_posts['post_status'] == 'publish')][blogs_columns].copy()

In [8]:
print('Num of blogs in db: {} are {}'.format(settings.DB_NAME, blogs.shape[0]))

Num of blogs in db: wp_koombea20stg are 677


We have some information about the blogs already, but we need to add more and convert some of the columns.

### Get `author_name` instead of `author_id`

In our `blogs` dataset we have the `post_author` columns. This contains the id of the author that creates the post. Let's see who they are by checking another table `wp_users` or `wp_usermeta`.

`wp_users` contains the following info:

|    | wp_users_columns    |
|---:|:--------------------|
|  0 | ID                  |
|  1 | user_login          |
|  2 | user_pass           |
|  3 | user_nicename       |
|  4 | user_email          |
|  5 | user_url            |
|  6 | user_registered     |
|  7 | user_activation_key |
|  8 | user_status         |
|  9 | display_name        |
| 10 | user_order          |


and `wp_usermeta` contains the following info:

|    | wp_usermeta_columns   |
|---:|:----------------------|
|  0 | umeta_id              |
|  1 | user_id               |
|  2 | meta_key              |
|  3 | meta_value            |

from the author we need just the `display_name` value, so it is no necessary to have more metadata than that. That's why we are going to use just the `wp_users` table and join to our created `blogs` dataset using the `post_author` id columns

In [9]:
author_ids = blogs.post_author.unique()
# sort ids
author_ids.sort()
# author display name
users = tables['wp_users'].copy()
filter_by_author_ids = users['ID'].isin(author_ids)
filtered_authors = users[filter_by_author_ids][['ID', 'display_name']]

`ID` to `display_name` mapping:

|    |   ID | display_name      |
|---:|-----:|:------------------|
|  3 |    5 | Luisa Linero      |
|  5 |    7 | Jonathan Tarud    |
|  6 |    8 | Robert Kazmi      |
|  7 |    9 | Fabian Altahona   |
|  9 |   14 | Mario Tatis       |
| 10 |   15 | Jose Gomez        |
| 11 |   16 | Alvaro Insignares |
| 13 |   19 | Guest Author      |
| 14 |   20 | Diego Coronado    |

Now we need to join tables in `post_author` columns from `blogs` dataset

#### Joining author table information to blog posts

In [10]:
# selecting users information
users = users[['ID', 'display_name']].copy()
# change columns names
users.columns = ['post_author', 'author_name']
# set post_author as index using inplace operation
users.set_index('post_author', drop=True, inplace=True)
# joining to blogs table
blogs = blogs.join(users, on='post_author')

In [11]:
blogs.tail(3)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name
2636,5932,8,2021-06-15 09:31:38,<!-- wp:paragraph -->\n<p>App conversion track...,App Conversion Tracking: Ultimate Guide,,app-conversion-tracking,2021-06-15 09:31:41,Robert Kazmi
2638,5934,16,2021-06-15 10:43:46,<!-- wp:paragraph -->\n<p>If your company is i...,The Fundamentals of Phygital Retail,,phygital,2021-06-15 10:43:47,Alvaro Insignares
2640,5937,8,2021-06-15 10:57:48,<!-- wp:paragraph -->\n<p>Human Augmentation i...,Human Augmentation: The Future Is Here,,human-augmentation,2021-06-15 10:57:50,Robert Kazmi


Now we have `post_author` being the post author id and `author_name` being the post author display name property

### Getting Industry information for our blogs: `industry_term` and `industry`

In our blogs dataset, we are still missing information about the industry, that means `industry_term` and `industry` for each blog post. This information coudl be located in the tables:

- `wp_terms`: This table contains the following information:

|    | wp_terms_cols   |
|---:|:----------------|
|  0 | term_id         |
|  1 | name            |
|  2 | slug            |
|  3 | term_group      |
|  4 | term_order      |

the `name` values are the display name for each term and the `slug` values are the representation name, this table contains multiple terms not just the `industry_term` but `content_type` as well. This table can be accessed using `term_id`

- `wp_termmeta`: This table contains the following information:

|    | wp_termmeta_cols   |
|---:|:-------------------|
|  0 | meta_id            |
|  1 | term_id            |
|  2 | meta_key           |
|  3 | meta_value         |

However the `meta_value` values of this table are not useful at all, because they are empty, so we will not use this table.

- `wp_term_taxonomy` This table contains the following information:

|    | wp_term_taxonomy   |
|---:|:-------------------|
|  0 | term_taxonomy_id   |
|  1 | term_id            |
|  2 | taxonomy           |
|  3 | description        |
|  4 | parent             |
|  5 | count              |

The `term_id` is related to a `taxonomy`, this `taxonomy` will show us if the `term_id` is an industry term, content type term and so on. The values that we have in `taxonomy` are:

| taxonomy         |
|:-----------------|
| nav_menu         |
| category         |
| post_tag         |
| industry         |
| buying_stage     |
| persona          |
| service_category |

- `wp_term_relationships`: This table contains the following information:

|    | wp_term_relationships   |
|---:|:------------------------|
|  0 | object_id               |
|  1 | term_taxonomy_id        |
|  2 | term_order              |

This table contains the relationship information between `object_id` that contains `ID` from `blogs` dataset and `term_taxonomy_id`, so we can use this to get the  `industry_term` for our blog.

#### Industries term and name

In [12]:
term_taxonomy = tables['wp_term_taxonomy'].copy()
terms = tables['wp_terms'].copy()
term_relationships = tables['wp_term_relationships'].copy()

In [13]:
industries = terms[terms.term_id.isin(term_taxonomy[term_taxonomy.taxonomy == 'industry'].term_id)]
industries

Unnamed: 0,term_id,name,slug,term_group,term_order
6,291,HiTech,hi-tech,0,0
16,301,All Industries,all-industries,0,0
22,307,Other,other,0,0
24,309,App Development,app-development,0,0
37,322,Koombea Culture,koombea-culture,0,0
41,326,FinTech,fintech,0,0
43,328,Retail,retail,0,0
48,333,IoT,iot,0,0
50,335,MedTech,medtech,0,0
113,398,Website Development,website-development,0,0


#### Joining `industry_term` and `industry` in blogs dataset

In [14]:
# Function: that joins information from terms to blogs
def join_terms_to_blogs(terms, term_taxonomy, term_relationships, blogs, kind):
    """ Join information from terms to blogs, selecting from kind.
    Args:
        terms (pd.DataFrame): pandas dataframe for wp_terms table
        term_taxonomy (pd.DataFrame): pandas dataframe for wp_term_taxonomy table
        term_relationships (pd.DataFrame): pandas dataframe for wp_term_relationships table
        blogs (pd.DataFrame): pandas dataframe that we are construction for our blogs dataset
        kind (str): kind specify the type of term - industry or content_type
    Returns:
        join table from terms information to blogs dataset
    """
    # from terms we need just term_id, name and slug columns
    terms_copy = terms[['term_id', 'name', 'slug']].copy() # select needed columns
    # from term_taxonomy select just term_taxonomy_id, term_id, taxonomy
    term_taxonomy_copy = term_taxonomy[['term_taxonomy_id', 'term_id', 'taxonomy']].copy()
    # from term_relationships select just object_id, term_taxonomy_id
    term_relationships_copy = term_relationships[['object_id', 'term_taxonomy_id']].copy()
    # change first column name for term_relationships_ to match blogs ID column
    term_relationships_copy.columns = ['ID', 'term_taxonomy_id']
    # make joins kind taxonomy from term_taxonomy_selection to term_relationships_selection
    term_taxonomy_copy = term_taxonomy_copy.join(terms_copy.set_index('term_id'), on='term_id')
    # filter by kind taxonomy
    term_taxonomy_kind = term_taxonomy_copy[term_taxonomy_copy.taxonomy == kind].copy()
    # select just term_taxonomy_id, term_id, name, and slug
    term_taxonomy_kind = term_taxonomy_kind[['term_taxonomy_id', 'term_id', 'name', 'slug']]
    # change cols names
    term_taxonomy_kind.columns = ['term_taxonomy_id', f'{kind}_id', f'{kind}_name', f'{kind}_slug']
    # make joins term_taxonomy_industries_selection  to term_relationships_selection
    term_relationships_copy = term_relationships_copy.join(term_taxonomy_kind.set_index('term_taxonomy_id'), on='term_taxonomy_id')
    # Drop na values
    term_relationships_copy.dropna(inplace=True)
    # In this table there are ID repeated let's drop repeated and keep frist
    term_relationships_copy.drop_duplicates(subset=['ID'], keep='first', inplace=True)
    # Drop taxonomy id columns
    term_relationships_copy.drop(columns=['term_taxonomy_id'], inplace=True)
    # joins term_relationships_ to blogs
    return blogs.join(term_relationships_copy.set_index('ID'), on='ID')

In [15]:
blogs = join_terms_to_blogs(terms, term_taxonomy, term_relationships, blogs, kind = 'industry')

In [16]:
blogs.tail(3)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug
2636,5932,8,2021-06-15 09:31:38,<!-- wp:paragraph -->\n<p>App conversion track...,App Conversion Tracking: Ultimate Guide,,app-conversion-tracking,2021-06-15 09:31:41,Robert Kazmi,309.0,App Development,app-development
2638,5934,16,2021-06-15 10:43:46,<!-- wp:paragraph -->\n<p>If your company is i...,The Fundamentals of Phygital Retail,,phygital,2021-06-15 10:43:47,Alvaro Insignares,328.0,Retail,retail
2640,5937,8,2021-06-15 10:57:48,<!-- wp:paragraph -->\n<p>Human Augmentation i...,Human Augmentation: The Future Is Here,,human-augmentation,2021-06-15 10:57:50,Robert Kazmi,335.0,MedTech,medtech


### Getting content type information for our blogs: `content_type_term` and `content_type`

To get this we are going to use the same logic used before, because the `industry` information and the `content_type` information have the same logic that's why we are going to use the same function

#### Content types term and name

In [17]:
content_types = terms[terms.term_id.isin(term_taxonomy[term_taxonomy.taxonomy == 'category'].term_id)]
content_types

Unnamed: 0,term_id,name,slug,term_group,term_order
1,6,Blog,blog,0,0
20,305,Press Release,press,0,0
218,503,Koombea Latam,es,0,0


#### Joining `content_type_term` and `content_type` in blogs dataset

In [18]:
blogs = join_terms_to_blogs(terms, term_taxonomy, term_relationships, blogs, kind = 'category')

In [19]:
blogs.tail(3)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug,category_id,category_name,category_slug
2636,5932,8,2021-06-15 09:31:38,<!-- wp:paragraph -->\n<p>App conversion track...,App Conversion Tracking: Ultimate Guide,,app-conversion-tracking,2021-06-15 09:31:41,Robert Kazmi,309.0,App Development,app-development,6.0,Blog,blog
2638,5934,16,2021-06-15 10:43:46,<!-- wp:paragraph -->\n<p>If your company is i...,The Fundamentals of Phygital Retail,,phygital,2021-06-15 10:43:47,Alvaro Insignares,328.0,Retail,retail,6.0,Blog,blog
2640,5937,8,2021-06-15 10:57:48,<!-- wp:paragraph -->\n<p>Human Augmentation i...,Human Augmentation: The Future Is Here,,human-augmentation,2021-06-15 10:57:50,Robert Kazmi,335.0,MedTech,medtech,6.0,Blog,blog


### Getting Language of the blogs

Given the `content_type` values we can get the language of each blog.

In [20]:
blogs['lang'] = blogs['category_slug'].apply(lambda x: x if x=='es' else 'en')

In [21]:
blogs.head(3)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug,category_id,category_name,category_slug,lang
92,2869,8,2021-03-22 09:00:00,"<span style=""font-weight: 400;"">Are you curiou...",Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en
96,2873,8,2014-09-15 00:00:00,<strong>A Brief History of Scrum Agile Develop...,Is a Scrum Agile Development Process Right for...,,is-a-scrum-agile-development-process-right-for...,2014-09-15 00:00:00,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en
99,2876,8,2014-09-08 00:00:00,Guessing is a part of life—and it’s necessary....,How to Make Your App Irresistible Through User...,,how-to-make-your-app-irresistible-through-user...,2014-09-08 00:00:00,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en


### Getting image information for our blogs: `image_alt` and `image_url`

- `wp_postmeta`: this table contains the following information:

|    | wp_postmeta_cols   |
|---:|:-------------------|
|  0 | meta_id            |
|  1 | post_id            |
|  2 | meta_key           |
|  3 | meta_value         |

The `meta_key` values have some variables that we are going to need to fullfill this method.

- `wp_options`: this table contains the following information:

|    | wp_options_cols   |
|---:|:------------------|
|  0 | option_id         |
|  1 | option_name       |
|  2 | option_value      |
|  3 | autoload          |

In the `option_name` columns we need the following options `siteurl`, `autoptimize_css_exclude`, `_thumbnail_id`, `_wp_attachment_metadata` and `_wp_attachment_image_alt`. The `siteurl` is the url of the respective db environment, `autoptimize_css_exclude` contains the middel path to construct the image url, `_thumbnail_id` will be useful to access to `_wp_attachment_image_alt` and `_wp_attachment_metadata`, in the first if it is empty the value is the blog title and the second will contains the filename for the last part of the image url.

In [22]:
options = tables['wp_options'].copy() # get options table
postmeta = tables['wp_postmeta'].copy() # get postmeta table
# retrieve siteurl information from options
site_url = options[options.option_name == 'siteurl'].option_value.item()
# retrieve content_wp (this is the middel path to construct image_url)
content_wp = options[options.option_name == 'autoptimize_css_exclude']['option_value'].item()
content_wp = content_wp.split(',')[1].strip().strip('/')
print('First part of the url:', site_url + '/' + content_wp + '/')

First part of the url: https://koombea20stg.wpengine.com/wp-content/uploads/


We have the first part of the url image, but we need the filename of the image for the respectives sizes `medium`, `thumbnail` or just `file`. If medium and thumbnail does not exist, and use thumbnail if medium does not exist.

In [23]:
@lru_cache
def extract_image_url(site_url, content_wp, value):
    """ Extract image url from php serialize information
    Args:
        site_url (str): site url from corresponding db environment
        content_wp (str): intermediate path to images content
        value (str): value from meta_value columns in wp_options table
            containing the php serialize _wp_attachment_metadata
    Returns:
        image url site_url/content_wp/image_file
    """
    # deserialize with php
    attachment_metadata = phps.loads(value.encode(), decode_strings = True)
    # get original file name
    file_name_split = attachment_metadata['file'].split('/') # split in case we need to change the last index
                                                    # (name for the specific size if there exist)
    files_size_info = attachment_metadata['sizes']
    if files_size_info:
        medium_file_info = files_size_info.get('medium')
        thumb_file_info = files_size_info.get('thumbnail')
        if medium_file_info:
            file_name_split[-1] = medium_file_info['file']
        elif thumb_file_info:
            file_name_split[-1] = thumb_file_info['file']
    return site_url + '/' + content_wp + '/' + '/'.join(file_name_split)

In [24]:
def get_blog2thumbnail_id_map(blogs, postmeta):
    """ Get pandas dataframe mapping of blog id to thumbnail id
    Args:
        blogs (pd.DataFrame): blogs dataset we are building
        postmeta (pd.DataFrame): postmeta pandas dataframe for wp_postmeta table
    Returns:
        pandas dataframe mapping ID (blog id columns in blogs) to thumbnail_id
    """
    # Filter by blogs ID and meta key _thumbnail_id
    blog2thumbnail = postmeta[(postmeta.post_id.isin(blogs.ID))  & (postmeta.meta_key == '_thumbnail_id')].copy()
    # Get just the needed columns
    blog2thumbnail = blog2thumbnail[['post_id', 'meta_value']]
    # cast _thumbnail_id meta_value to int64
    blog2thumbnail.meta_value = blog2thumbnail.meta_value.astype('int')
    # change columns names to the corresponding
    blog2thumbnail.columns = ['ID', 'thumbnail_id']
    return blog2thumbnail

In [25]:
def get_postmeta_image(blogs, postmeta, kind):
    """ Get postmeta kind data from postmeta
    Args:
        blogs (pd.DataFrame): blogs dataset we are building
        postmeta (pd.DataFrame): postmeta pandas dataframe for wp_postmeta table
        kind (str): string tha must be either 'image_alt' or 'metadata'
    Returns:
        pandas dataframe containing ID (blog id columns in blogs) to postmeta kind data
    """
    # get auxiliary mapping from blog ID to thumbnail_id
    blog2thumbnail = get_blog2thumbnail_id_map(blogs, postmeta)
    # get kind of metakey
    metakey_df = postmeta[(postmeta.post_id.isin(blog2thumbnail.thumbnail_id)) &
                          (postmeta.meta_key == '_wp_attachment_{}'.format(kind))].copy()
    # select the necessary columns
    metakey_df = metakey_df[['post_id', 'meta_value']]
    # change the columns
    col_name = kind if kind == 'image_alt' else 'image'
    metakey_df.columns = ['thumbnail_id', col_name]
    # joins metake data frame containing the meta value to auxiliary mapping
    return blog2thumbnail.join(metakey_df.set_index('thumbnail_id'),
                               on='thumbnail_id').drop(columns = 'thumbnail_id')

In [26]:
def join_image_alt_to_blogs(blogs, postmeta):
    """ Join image alt to blogs dataframe on 'ID' (blog_id)
    Args:
        blogs (pd.DataFrame): blogs dataset we are building
        postmeta (pd.DataFrame): postmeta pandas dataframe for wp_postmeta table
    Returns:
        blogs dataframe with joining information from image alt info
    """
    # get image alt dataframe
    image_alt_df = get_postmeta_image(blogs, postmeta, kind='image_alt')
    # joins to blogs_copy
    blogs_copy = blogs.copy()
    blogs_copy = blogs_copy.join(image_alt_df.set_index('ID'), on='ID')
    # filling NA values from post title attribute
    blogs_copy['image_alt'].fillna(value=blogs_copy['post_title'], inplace=True)
    return blogs_copy

In [27]:
def join_image_url_to_blogs(blogs, postmeta, site_url, content_wp):
    """ Join iamge url to blogs dataframe on 'ID' (blog_id)
    Args:
        blogs (pd.DataFrame): blogs dataset we are building
        postmeta (pd.DataFrame): postmeta pandas dataframe for wp_postmeta table
        site_url (str): site url from wp_options
    """
    # get image dataframe
    image_df = get_postmeta_image(blogs, postmeta, kind='metadata')
    # extract image url
    image_df.image = image_df.image.apply(lambda val: extract_image_url(site_url, content_wp, val))
    # joins to blogs
    return blogs.join(image_df.set_index('ID'), on='ID')

#### Joining `image_alt` to blogs

In [28]:
blogs = join_image_alt_to_blogs(blogs, postmeta)
blogs.head(2)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug,category_id,category_name,category_slug,lang,image_alt
92,2869,8,2021-03-22 09:00:00,"<span style=""font-weight: 400;"">Are you curiou...",Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Scalable Applications: Curious Why Scalability...
96,2873,8,2014-09-15 00:00:00,<strong>A Brief History of Scrum Agile Develop...,Is a Scrum Agile Development Process Right for...,,is-a-scrum-agile-development-process-right-for...,2014-09-15 00:00:00,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Is a Scrum Agile Development Process Right for...


#### Joining `image` (image_url) to blogs

In [29]:
blogs = join_image_url_to_blogs(blogs, postmeta, site_url, content_wp)
blogs.head(2)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug,category_id,category_name,category_slug,lang,image_alt,image
92,2869,8,2021-03-22 09:00:00,"<span style=""font-weight: 400;"">Are you curiou...",Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Scalable Applications: Curious Why Scalability...,https://koombea20stg.wpengine.com/wp-content/u...
96,2873,8,2014-09-15 00:00:00,<strong>A Brief History of Scrum Agile Develop...,Is a Scrum Agile Development Process Right for...,,is-a-scrum-agile-development-process-right-for...,2014-09-15 00:00:00,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Is a Scrum Agile Development Process Right for...,https://koombea20stg.wpengine.com/wp-content/u...


### Getting post url information for our blogs: `post_url`

We are going to need some information about how to construct `post_url` this information is located in the tables:

- `blogs`: blogs dataset table, from here we are going to need the `category_slug` and the `post_name`
- `wp_options`: this table contains the following information:

|    | wp_options_cols   |
|---:|:------------------|
|  0 | option_id         |
|  1 | option_name       |
|  2 | option_value      |
|  3 | autoload          |

In the `option_name` columns we have two options needed `siteurl` and `permalink_structure`, the `siteurl` is the url of the respective db environment and the `permalink_structure` give us the information of how to construct the url path, that's why we need `category_slug` and `post_name` because this value is equal to:

``` python
print(options[options.option_name == 'permalink_structure'].option_value.item())
```

The previous code have the following result, where options is the pandas dataframe containing the `wp_options` table information:

`/%category%/%postname%/`

So our `post_url` will be forme as:

``` python
post_url = f'{site_url}/{category}/{postname}/'
``` 

In [30]:
# concatenate site_url with category_slug and post_name attribute of each blog
blogs['post_url'] = site_url + '/' + blogs.category_slug + '/' + blogs.post_name + '/'

### Complete extracted Blogs dataset

These dataset contains pretty much what we need and stated at the beginning of this notebook

In [31]:
blogs.head(3)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug,category_id,category_name,category_slug,lang,image_alt,image,post_url
92,2869,8,2021-03-22 09:00:00,"<span style=""font-weight: 400;"">Are you curiou...",Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Scalable Applications: Curious Why Scalability...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/why-sca...
96,2873,8,2014-09-15 00:00:00,<strong>A Brief History of Scrum Agile Develop...,Is a Scrum Agile Development Process Right for...,,is-a-scrum-agile-development-process-right-for...,2014-09-15 00:00:00,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Is a Scrum Agile Development Process Right for...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/is-a-sc...
99,2876,8,2014-09-08 00:00:00,Guessing is a part of life—and it’s necessary....,How to Make Your App Irresistible Through User...,,how-to-make-your-app-irresistible-through-user...,2014-09-08 00:00:00,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,How to Make Your App Irresistible Through User...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/how-to-...


This will finish the Extraction process and Load process of the complete dataset with the information needed, now we need to process the dataset

### Save blogs information

In [34]:
# !mkdir ../data

In [36]:
blogs.to_csv(f'../data/{settings.DB_NAME}_blogs.csv', index=False)

## Preprocessing information and clean for model training

In [29]:
# load data
blogs = pd.read_csv(f"../data/{settings.DB_NAME}_blogs.csv")
blogs.head(2)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug,category_id,category_name,category_slug,lang,image_alt,image,post_url
0,2869,8,2021-03-22 09:00:00,"<span style=""font-weight: 400;"">Are you curiou...",Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Scalable Applications: Curious Why Scalability...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/why-sca...
1,2873,8,2014-09-15 00:00:00,<strong>A Brief History of Scrum Agile Develop...,Is a Scrum Agile Development Process Right for...,,is-a-scrum-agile-development-process-right-for...,2014-09-15 00:00:00,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Is a Scrum Agile Development Process Right for...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/is-a-sc...


### Check `NA` values
The NA values must be in `post_excerpt` column only, and we will fill this values with empty string.

In [30]:
# Check NA, the NA values must be in post excerpt only
for col in blogs.columns:
    blogs[col].isna().sum() != 0 and print(f"{col} column contains na values")

post_excerpt column contains na values


In [31]:
blogs["post_excerpt"].fillna("", inplace=True)

In [32]:
blogs.head(2)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug,category_id,category_name,category_slug,lang,image_alt,image,post_url
0,2869,8,2021-03-22 09:00:00,"<span style=""font-weight: 400;"">Are you curiou...",Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Scalable Applications: Curious Why Scalability...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/why-sca...
1,2873,8,2014-09-15 00:00:00,<strong>A Brief History of Scrum Agile Develop...,Is a Scrum Agile Development Process Right for...,,is-a-scrum-agile-development-process-right-for...,2014-09-15 00:00:00,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Is a Scrum Agile Development Process Right for...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/is-a-sc...


Expected output: `post_excerpt column contains na values`

### Getting data for training

- parse html blog content
- tokenize
- clean stopwords
- clean punctuation

#### Preview blogs content
In the `post_content` columns we have the blog's content information, but this info is not in a clean format, because it's HTML format. So we need to parse it, for that we are goint to use BeatifulSoup python package.

In [33]:
example = blogs[blogs["lang"] == "es"].post_content.iloc[0]
print(example[:int(len(example)*0.2)])

<span style="font-weight: 400;">Tener una tienda online solía ser un lujo para las empresas. En cuestión de semanas, debido a las medidas de distanciamiento físico, todas las reglas del mundo de los negocios han cambiado. Las soluciones de eCommerce pasaron a ser de repente una necesidad. Mientras las medidas de cuarentena siguen en pie, muchas empresas están luchando para adaptarse.</span>

<span style="font-weight: 400;">Algunos negocios esperan que todo vuelva eventualmente a la normalidad. Sin embargo, esto pareciera ser altamente improbable por donde se mire. Las empresas necesitan tener una solución de ventas en línea para estos tiempos caóticos que vivimos.&nbsp;</span>

<span style="font-weight: 400;">En Koombea creemos que </span><a href="https://es.shopify.com/"><span style="font-weight: 400;">Shopify</span></a><span style="font-weight: 400;"> ofrece la mejor solución para ayudar a los negocios de todos los tamaños a adaptarse a esta nueva normalidad. Este artículo analiza al

#### Parse content blog `HTML -> string`

In [34]:
def clean_html(html_text):
    """ Parse HTML text and return cleand text without html tags.
    Args:
        html_text (str): String with html tags.
    Returns:
        String parsed, without html tags.
    """
    html_text = ">\n".join(html_text.split(">")) # insert a break line after the ending of all tags
    soup = bs4.BeautifulSoup(html_text, 'html.parser')
    return soup.get_text()

In [35]:
print("\033[96mBefore cleaning html:\033[0m ")
print(example[:int(len(example)*0.2)], end = "\n\n")
print("\033[96mAfter cleaning html:\033[0m ")
example = clean_html(example)
print(example[:int(len(example)*0.2)])

[96mBefore cleaning html:[0m 
<span style="font-weight: 400;">Tener una tienda online solía ser un lujo para las empresas. En cuestión de semanas, debido a las medidas de distanciamiento físico, todas las reglas del mundo de los negocios han cambiado. Las soluciones de eCommerce pasaron a ser de repente una necesidad. Mientras las medidas de cuarentena siguen en pie, muchas empresas están luchando para adaptarse.</span>

<span style="font-weight: 400;">Algunos negocios esperan que todo vuelva eventualmente a la normalidad. Sin embargo, esto pareciera ser altamente improbable por donde se mire. Las empresas necesitan tener una solución de ventas en línea para estos tiempos caóticos que vivimos.&nbsp;</span>

<span style="font-weight: 400;">En Koombea creemos que </span><a href="https://es.shopify.com/"><span style="font-weight: 400;">Shopify</span></a><span style="font-weight: 400;"> ofrece la mejor solución para ayudar a los negocios de todos los tamaños a adaptarse a esta nueva norm

Apply cleaning html function to all elements in the column `post_content`

In [36]:
blogs["post_content_html"] = blogs["post_content"].copy()
blogs["post_content"] = blogs["post_content"].apply(clean_html)

In [37]:
blogs.head(1)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,industry_name,industry_slug,category_id,category_name,category_slug,lang,image_alt,image,post_url,post_content_html
0,2869,8,2021-03-22 09:00:00,\nAre you curious why scalable applications ar...,Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,HiTech,hi-tech,6.0,Blog,blog,en,Scalable Applications: Curious Why Scalability...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/why-sca...,"<span style=""font-weight: 400;"">Are you curiou..."


#### Joining text data
We are going to join text data to get more information about the blog

- post_content (clean version without html tags)
- post_excerpt
- post_name
- post_title
- industry_name

In [38]:
blogs["data"] = blogs["post_content"] + " " + blogs["post_excerpt"] + \
                " " + blogs["post_name"].apply(lambda val: " ".join(val.split("-"))) + \
                " " + blogs["post_title"] + " " + blogs["industry_name"]

In [39]:
blogs.head(1)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,...,industry_slug,category_id,category_name,category_slug,lang,image_alt,image,post_url,post_content_html,data
0,2869,8,2021-03-22 09:00:00,\nAre you curious why scalable applications ar...,Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,...,hi-tech,6.0,Blog,blog,en,Scalable Applications: Curious Why Scalability...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/why-sca...,"<span style=""font-weight: 400;"">Are you curiou...",\nAre you curious why scalable applications ar...


#### Preprocess data

Now that we have our colunm with the data needed, we are going to clean up with the following operations:

- remove non readable unicode strings
- remove unused white spaces
- remove stopwords
- remove punctuations
- remove numbers
- remove special tokens
- lemmatize
- normalize to lowercase

In [40]:
stopwords = many_stop_words.get_stop_words('en').union(many_stop_words.get_stop_words('es')) # Join both

In [41]:
def get_regex_expression():
    """ Return basic regex expression for text normalization
    Returns:
        dictionary mapping regex expression and replacement string.
    """
    # Basic normalization
    _patterns_ = [r'\'',
             r'\"',
             r'\.',
             r'<br \/>',
             r',',
             r'\(',
             r'\)',
             r'\!',
             r'\?',
             r'\;',
             r'\:',
             r'\s+']

    _replacements_ = [' \'  ',
                     ' ',
                     ' . ',
                     ' ',
                     ' , ',
                     ' ( ',
                     ' ) ',
                     ' ! ',
                     ' ? ',
                     ' ',
                     ' ',
                     ' ']

    # Match non alphanumeric characters
    NON_ALPHANUMERIC_REGEX = r'[^a-zA-Z0-9À-ÿ\u00f1\u00d1\s]'
    # Match numerical characters
    NUMERICAL_REGEX = r'[0-9]+'
    # Match any link or url from text
    LINKS_REGEX = r'https?:\/\/.*[\r\n]'
    # Match hashtags
    HASHTAGS_REGEX = r'\#[^\s]*'
    # Match twitter accounts
    TWITTER_ACCOUNTS_REGEX = r'\@[^\s]*'
    # Match Author:
    AUTHOR_REGEX = r'author'
    # Match email
    EMAIL_REGEX = r"\S*@\S+"
    # Group regex
    patterns = [
        LINKS_REGEX,
        HASHTAGS_REGEX,
        TWITTER_ACCOUNTS_REGEX,
        AUTHOR_REGEX,
        EMAIL_REGEX,
        NON_ALPHANUMERIC_REGEX,
        NUMERICAL_REGEX
    ]
    replacements = [' ']*(len(patterns))
    patterns += _patterns_
    replacements += _replacements_
    patterns_dict = list((re.compile(p), r) for p, r in zip(patterns, replacements))
    return patterns_dict

patterns_dict = get_regex_expression()

In [42]:
def remove_unnecesary_whitespaces(text):
    """ Remove unnecesary types of whitespaces
    Args:
        text (str): String text
    Returns:
        text cleaned off whitespaces.
    """
    return " ".join(text.strip().split())

In [43]:
def basic_normalization(text, patterns_dict):
    """ Perform basic normalization on text.
    Args:
        text (str): String text.
        patterns_dict (Dic[re.Pattern, str]): dictionary mapping the regular 
            pattern to replacement string.
    Returns:
        text normalize.
    """
    text = text.lower()
    for pattern_re, replace_str in patterns_dict:
        text = pattern_re.sub(replace_str, text)
    return text

Apply basic normalization.

In [44]:
example = blogs[blogs["lang"] == "es"].data.iloc[0]
print("\033[96mBefore normalize:\033[0m ")
print(example[:int(len(example)*0.2)])
example = basic_normalization(example, patterns_dict)
print("\033[96mAfter normalize:\033[0m ")
print(example[:int(len(example)*0.2)])

[96mBefore normalize:[0m 

Tener una tienda online solía ser un lujo para las empresas. En cuestión de semanas, debido a las medidas de distanciamiento físico, todas las reglas del mundo de los negocios han cambiado. Las soluciones de eCommerce pasaron a ser de repente una necesidad. Mientras las medidas de cuarentena siguen en pie, muchas empresas están luchando para adaptarse.

Algunos negocios esperan que todo vuelva eventualmente a la normalidad. Sin embargo, esto pareciera ser altamente improbable por donde se mire. Las empresas necesitan tener una solución de ventas en línea para estos tiempos caóticos que vivimos. 

En Koombea creemos que 


Shopify


 ofrece la mejor solución para ayudar a los negocios de todos los tamaños a adaptarse a esta nueva normalidad. Este artículo analiza algunos de los elementos básicos a tener en cuenta para usar Shopify como plataforma de eCommerce y sacarle un mejor provecho. La transformación digital de tu negocio no tiene que ser un problema. 


Let's clean whitespaces.

In [45]:
example = remove_unnecesary_whitespaces(example)
print("\033[96mAfter removing extra white spaces:\033[0m ")
print(example[:int(len(example)*0.2)])

[96mAfter removing extra white spaces:[0m 
tener una tienda online solía ser un lujo para las empresas en cuestión de semanas debido a las medidas de distanciamiento físico todas las reglas del mundo de los negocios han cambiado las soluciones de ecommerce pasaron a ser de repente una necesidad mientras las medidas de cuarentena siguen en pie muchas empresas están luchando para adaptarse algunos negocios esperan que todo vuelva eventualmente a la normalidad sin embargo esto pareciera ser altamente improbable por donde se mire las empresas necesitan tener una solución de ventas en línea para estos tiempos caóticos que vivimos en koombea creemos que shopify ofrece la mejor solución para ayudar a los negocios de todos los tamaños a adaptarse a esta nueva normalidad este artículo analiza algunos de los elementos básicos a tener en cuenta para usar shopify como plataforma de ecommerce y sacarle un mejor provecho la transformación digital de tu negocio no tiene que ser un problema configur

Tokenize, now we need to tokenize our texts, we are going to use nlp object from spacy, and whilte tokenizing we are going to lemmatize and remove stopwords in our text too.

In [46]:
nlp_es = spacy.load("es_core_news_sm") # load spanish tokenizer
nlp_en = spacy.load("en_core_web_sm") # load english tokenizer

In [47]:
def tokenize(text, lang, lemmatize=True, remove_stops=True):
    """ Receive a basic normalize text to tokenize, we can apply lemmatize or remove stops.
    Args:
        text (str): Basic normalize text.
        lang (str): languange, can be either 'en' or 'es'
        lemmatize (bool): Wether to lemmatize or not.
        remove_stops (bool): Wether to remove stops or not.
    Return:
        List[str] of tokens
    """
    # choose the right tokenizer
    nlp = nlp_en if lang == 'en' else nlp_es
    # tokenize
    tokens = nlp(text)
    # check for number of punctuation, just in case
    tokens = [token for token in tokens if not (token.is_punct or token.like_num)]
    if remove_stops:
        # remove stops
        tokens = [token for token in tokens if not (token.is_stop or token.text in stopwords)]
    return [token.lemma_.strip() for token in tokens] if lemmatize else [token.text.strip() for token in tokens]

In [48]:
print("\033[96mBefore tokenizing:\033[0m ")
print(example[:int(len(example)*0.2)])
print("\033[96mAfter tokenizing:\033[0m ")
print(tokenize(example[:int(len(example)*0.2)], "es"))

[96mBefore tokenizing:[0m 
tener una tienda online solía ser un lujo para las empresas en cuestión de semanas debido a las medidas de distanciamiento físico todas las reglas del mundo de los negocios han cambiado las soluciones de ecommerce pasaron a ser de repente una necesidad mientras las medidas de cuarentena siguen en pie muchas empresas están luchando para adaptarse algunos negocios esperan que todo vuelva eventualmente a la normalidad sin embargo esto pareciera ser altamente improbable por donde se mire las empresas necesitan tener una solución de ventas en línea para estos tiempos caóticos que vivimos en koombea creemos que shopify ofrece la mejor solución para ayudar a los negocios de todos los tamaños a adaptarse a esta nueva normalidad este artículo analiza algunos de los elementos básicos a tener en cuenta para usar shopify como plataforma de ecommerce y sacarle un mejor provecho la transformación digital de tu negocio no tiene que ser un problema configurando shopify la 

Tokenize function to wrap up everything.

In [49]:
def process_data(text, lang, lemmatize, remove_stops, patterns_dict, normalize=True):
    """Process all the raw data
    Args:
        text (str): raw text data (cleaned up of html tags)
        lang (str): languange, can be either 'en' or 'es'
        lemmatize (bool): Wether to lemmatize or not.
        remove_stops (bool): Wether to remove stops or not.
        patterns_dict (Dic[re.Pattern, str]): dictionary mapping the regular 
            pattern to replacement string.
        normalize (bool): Wether to apply simple text normalize or not.
    Returns:
        List[str] of tokens
    """
    text = remove_unnecesary_whitespaces(basic_normalization(text, patterns_dict)) if normalize else text
    return tokenize(text, lang, lemmatize, remove_stops)

In [50]:
def pandas_process_data_wrap(raw_data, lemmatize, remove_stops, patterns_dict, normalize):
    return process_data(raw_data.data, raw_data.lang, lemmatize=lemmatize,
                        remove_stops=remove_stops, patterns_dict=patterns_dict, normalize=normalize)

In [51]:
tqdm.pandas(desc="Preprocessing data") # set pandas tqdm

In [52]:
blogs["tokenized_data"] = blogs[["data", "lang"]].progress_apply(pandas_process_data_wrap, axis=1,
                                                                 args=(True, True, patterns_dict, True))

Preprocessing data: 100%|██████████| 677/677 [01:10<00:00,  9.54it/s]


### Preview blogs dataset

In [53]:
blogs.head(2)

Unnamed: 0,ID,post_author,post_date,post_content,post_title,post_excerpt,post_name,post_modified,author_name,industry_id,...,category_id,category_name,category_slug,lang,image_alt,image,post_url,post_content_html,data,tokenized_data
0,2869,8,2021-03-22 09:00:00,\nAre you curious why scalable applications ar...,Scalable Applications: Curious Why Scalability...,,why-scalability-matters-for-your-app,2021-03-22 08:56:08,Robert Kazmi,291.0,...,6.0,Blog,blog,en,Scalable Applications: Curious Why Scalability...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/why-sca...,"<span style=""font-weight: 400;"">Are you curiou...",\nAre you curious why scalable applications ar...,"[curious, scalable, application, popular, ve, ..."
1,2873,8,2014-09-15 00:00:00,\nA Brief History of Scrum Agile Development: ...,Is a Scrum Agile Development Process Right for...,,is-a-scrum-agile-development-process-right-for...,2014-09-15 00:00:00,Robert Kazmi,291.0,...,6.0,Blog,blog,en,Is a Scrum Agile Development Process Right for...,https://koombea20stg.wpengine.com/wp-content/u...,https://koombea20stg.wpengine.com/blog/is-a-sc...,<strong>A Brief History of Scrum Agile Develop...,\nA Brief History of Scrum Agile Development: ...,"[history, scrum, agile, development, year, hir..."


### Preprocessed and cleaned data

In [54]:
blogs.to_csv(f"../data/{settings.DB_NAME}_processed_blogs.csv", index=False)