## Dusting off an Old Blog

Sometime ago, about ten years, on my first trip to India I had kept a blog over the first several months. After that, I suspect I got far too busy to keep it up. I had forgotten about it until I started writing agaain a few weeks ago. And I thought it might be fun to try and find it. As it turned out, all I had of it was a sql dump from the wordpress database I had back then. Extracting and formatting it manually would be a pain, so I thought I'd use this as an opportunity to apply some of the recent learning using sql, some basic formatting functions, markdown, etc to do the work for me. Notes on that below. This took the better part of a Sunday afternoon. A lot of that actually just getting mySQL to work, but still I was reminded that knowing how to do something and having the fluency that comes from having mastered it are two totally different things.  

### A. Setup

In [27]:
import pymysql
import pandas as pd
import html2text
import string

%matplotlib inline

In [4]:
conn = pymysql.connect(host='localhost',
                             user='root',
                             password='gremlins',
                             db='delhiblog')

def run_query(q):
    return pd.read_sql(q, conn)


In [9]:
q = 'select id, post_date, post_title, post_content from wp_posts;'

df = run_query(q)

df

Unnamed: 0,id,post_date,post_title,post_content
0,1,2006-12-24 22:10:49,Washington to Delhi: Preparations Underway,"<div style=""text-align: center""><img alt=""The ..."
1,2,2006-12-24 22:10:49,About,"This is an example of a WordPress page, you co..."
2,3,2007-01-10 01:01:13,DC to London to Delhi,"I don't have much time, but I wanted to write ..."
3,4,2007-01-25 12:28:55,A Place to Start,Ive been here on the ground for just over two...
4,5,2007-01-30 09:06:36,Changing Times...,I finished the last post talking about what ha...
5,6,2007-01-30 09:54:15,Ashoka in the New York Times!!,I remember a lot of you wondering what work I ...
6,7,2007-02-05 15:11:36,Super Bowl XLI over Coffee in Chanakyapuri,I met up with some friends after work on Frida...
7,8,2007-02-06 12:16:02,Encounter of a Poetic Nature,I had a chance to see a poetry reading by <a t...
8,9,2007-02-09 15:49:34,CR Park in Delhi,
9,10,2007-02-09 16:12:37,Kali Mandir CR Park,


### B. Plan of Approach

This is what I'm thinking: 

* loop through the list of posts
* copy the content into a dictionary - this is probably not necessary on second thought; can just leave in dataframe; don't think there is additional value. 
* add the yaml content at the top which looks like the below; basically you only care about title and date
    * the title will need to be condensed a bit
    * will need date
* save the files title in a particular way: date-title.md in some folder


```
---
layout: post
title: "Mother is Supreme"
date: 2003-09-03
categories: [college]
---

```

### C. Preliminaries: Testing the pieces

#### 1. What does the post content look like?

In [17]:

q = 'select post_content from wp_posts where ID = 1;'

result = run_query(q)

result.post_content[0]



'<div style="text-align: center"><img alt="The White House" title="The White House" src="http://www.barton.edu/school-dept/history&ss/White%20House.jpeg" /></div>\r\n<div style="text-align: center"><strong>From the White House...</strong></div>\r\nLeaving the country for six months is -- as it turns out -- not that easy to do; its something of a logistical nightmare. Where will your mail go? How will I get money over there? What should I do with my car insuance? These are a few of the never ending number of things that I\'m trying to figure out and handle. And of course, getting business done during Christmas break is almost futile because like me, nobody is at work. Anyway, suffice it to say that I have 14 days left before I leave for Delhi, and there is much to do.\r\n\r\nAll of this stuff to do before Delhi has left me little time to think of how it will be to actually be in Delhi. I\'ve seen photos of Delhi so I think I\'m somewhat prepared for what it will look like at lease. And 

#### 2. The meta content

In [7]:
testyaml = '''
---
layout: post\r
title: "Mother is Supreme"\r
date: 2003-09-03\r
categories: [college]\r
---\n
'''

print(testyaml)


---
layout: post
title: "Mother is Supreme"
date: 2003-09-03
categories: [college]
---




#### 3. Converting to Markdown

In [19]:
print(html2text.html2text("<p>Hello, <a href='http://earth.google.com/'>world</a>!"))

Hello, [world](http://earth.google.com/)!




In [44]:
print(st)


---
layout: post
title: "Mother is Supreme"
date: 2003-09-03
categories: [college]
---

![The White House](http://www.barton.edu/school-
dept/history&ss/White%20House.jpeg)

**From the White House...**

Leaving the country for six months is -- as it turns out -- not that easy to
do; its something of a logistical nightmare. Where will your mail go? How will
I get money over there? What should I do with my car insuance? These are a few
of the never ending number of things that I'm trying to figure out and handle.
And of course, getting business done during Christmas break is almost futile
because like me, nobody is at work. Anyway, suffice it to say that I have 14
days left before I leave for Delhi, and there is much to do. All of this stuff
to do before Delhi has left me little time to think of how it will be to
actually be in Delhi. I've seen photos of Delhi so I think I'm somewhat
prepared for what it will look like at lease. And I'm almost done reading this
book called a City of

#### 4. Writing String to a File

In [42]:
f = open("test.md","w") #opens file with name of "test.txt"
st = testyaml+html2text.html2text(result.post_content[0])
f.write(st)

f.close()

### Write the Functions

#### Extract the Date
There is date time and I just want the date formatted as a string

In [10]:
df['justDate'] = df['post_date'].dt.date

In [11]:
df.sample(5)

Unnamed: 0,id,post_date,post_title,post_content,justDate
11,12,2007-02-12 23:06:23,"""My Cave""",I must admit that I had little idea about what...,2007-02-12
2,3,2007-01-10 01:01:13,DC to London to Delhi,"I don't have much time, but I wanted to write ...",2007-01-10
14,16,2007-02-28 20:55:35,Delhi Lessons Part I: The Kitna Hoga Tango,I've now written in some details about much of...,2007-02-28
6,7,2007-02-05 15:11:36,Super Bowl XLI over Coffee in Chanakyapuri,I met up with some friends after work on Frida...,2007-02-05
7,8,2007-02-06 12:16:02,Encounter of a Poetic Nature,I had a chance to see a poetry reading by <a t...,2007-02-06


#### Extract the Short Title
The titles are too long in some cases; need a shorter version

In [53]:
def getFileName(row):
    x = row.post_title
    d = row.justDate
    
    maxt = 25;
    exclude = set(string.punctuation)
    x = ''.join(ch for ch in x if ch not in exclude)
    if len(x) < maxt:
        x = x.replace(" ","")
    else:
        x = ''.join(x[:maxt+1].split(' ')[0:-1])
    return (str(d)+'-'+x+'.md')

In [54]:
df['fileName'] = df.apply(getFileName, axis = 1)

In [55]:
df.sample(5)

Unnamed: 0,id,post_date,post_title,post_content,justDate,fileName
7,8,2007-02-06 12:16:02,Encounter of a Poetic Nature,I had a chance to see a poetry reading by <a t...,2007-02-06,2007-02-06-EncounterofaPoetic.md
5,6,2007-01-30 09:54:15,Ashoka in the New York Times!!,I remember a lot of you wondering what work I ...,2007-01-30,2007-01-30-AshokaintheNewYork.md
14,16,2007-02-28 20:55:35,Delhi Lessons Part I: The Kitna Hoga Tango,I've now written in some details about much of...,2007-02-28,2007-02-28-DelhiLessonsPartIThe.md
2,3,2007-01-10 01:01:13,DC to London to Delhi,"I don't have much time, but I wanted to write ...",2007-01-10,2007-01-10-DCtoLondontoDelhi.md
1,2,2006-12-24 22:10:49,About,"This is an example of a WordPress page, you co...",2006-12-24,2006-12-24-About.md


In [24]:
import string
s = "string. With. Punctuation?" # Sample string 
exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)
print(s)

string With Punctuation


#### Construct YAML

In [65]:
def buildYAML(row):
    d = str(row.justDate)
    t = row.post_title
    
    s = '''
    ---
    layout: post\r
    title: "{}"\r
    date: {}\r
    categories: [delhiblog2007]\r
    ---\n
    '''.format(t,d)
    return s

In [66]:
df['yaml'] = df.apply(buildYAML, axis = 1)

In [67]:
df.sample(5)

Unnamed: 0,id,post_date,post_title,post_content,justDate,fileName,yaml
0,1,2006-12-24 22:10:49,Washington to Delhi: Preparations Underway,"<div style=""text-align: center""><img alt=""The ...",2006-12-24,2006-12-24-WashingtontoDelhi.md,"\n ---\n layout: post\r\n title: ""Was..."
3,4,2007-01-25 12:28:55,A Place to Start,Ive been here on the ground for just over two...,2007-01-25,2007-01-25-APlacetoStart.md,"\n ---\n layout: post\r\n title: ""A P..."
11,12,2007-02-12 23:06:23,"""My Cave""",I must admit that I had little idea about what...,2007-02-12,2007-02-12-MyCave.md,"\n ---\n layout: post\r\n title: """"My..."
19,21,2007-04-22 21:29:53,Life in Delhi Lesson III: Thou Must Be Aggress...,"<p align=""center""><img alt=""Everybody needs to...",2007-04-22,2007-04-22-LifeinDelhiLessonIII.md,"\n ---\n layout: post\r\n title: ""Lif..."
9,10,2007-02-09 16:12:37,Kali Mandir CR Park,,2007-02-09,2007-02-09-KaliMandirCRPark.md,"\n ---\n layout: post\r\n title: ""Kal..."


#### Loop through dataframe, generate markdown, and write to file

In [70]:
def getMdContent(row):
    yaml = row.yaml
    content = row.post_content
        
    mdcontent = html2text.html2text(content)
    mdcontent = yaml + mdcontent
    
    return mdcontent

In [71]:
df['md_content'] = df.apply(getMdContent, axis = 1)

df.sample(5)

Unnamed: 0,id,post_date,post_title,post_content,justDate,fileName,yaml,md_content
12,15,2007-02-27 23:24:22,Cows and Camels and Elephants Oh My! (The Street),"<p align=""center""><img title=""Yes an elephant!...",2007-02-27,2007-02-27-CowsandCamelsand.md,"\n ---\n layout: post\r\n title: ""Cow...","\n ---\n layout: post\r\n title: ""Cow..."
1,2,2006-12-24 22:10:49,About,"This is an example of a WordPress page, you co...",2006-12-24,2006-12-24-About.md,"\n ---\n layout: post\r\n title: ""Abo...","\n ---\n layout: post\r\n title: ""Abo..."
8,9,2007-02-09 15:49:34,CR Park in Delhi,,2007-02-09,2007-02-09-CRParkinDelhi.md,"\n ---\n layout: post\r\n title: ""CR ...","\n ---\n layout: post\r\n title: ""CR ..."
11,12,2007-02-12 23:06:23,"""My Cave""",I must admit that I had little idea about what...,2007-02-12,2007-02-12-MyCave.md,"\n ---\n layout: post\r\n title: """"My...","\n ---\n layout: post\r\n title: """"My..."
6,7,2007-02-05 15:11:36,Super Bowl XLI over Coffee in Chanakyapuri,I met up with some friends after work on Frida...,2007-02-05,2007-02-05-SuperBowlXLIover.md,"\n ---\n layout: post\r\n title: ""Sup...","\n ---\n layout: post\r\n title: ""Sup..."


In [78]:
def writeBlogs(row):
    filename = row.fileName
    content = row.md_content
    status = 'none'
    
    try:
        f = open(filename,"w") #opens file with name of "test.txt"
        f.write(content)
        f.close()
        status = 'success'
    except:
        status = 'fail'
        
    return status

In [79]:
df['writeStatus'] = df.apply(writeBlogs, axis=1)
df.sample(5)

Unnamed: 0,id,post_date,post_title,post_content,justDate,fileName,yaml,md_content,writeStatus
7,8,2007-02-06 12:16:02,Encounter of a Poetic Nature,I had a chance to see a poetry reading by <a t...,2007-02-06,2007-02-06-EncounterofaPoetic.md,"\n ---\n layout: post\r\n title: ""Enc...","\n ---\n layout: post\r\n title: ""Enc...",success
20,22,NaT,Travels,I've managed to squeeze in a fair amount of tr...,,nan-Travels.md,"\n ---\n layout: post\r\n title: ""Tra...","\n ---\n layout: post\r\n title: ""Tra...",success
8,9,2007-02-09 15:49:34,CR Park in Delhi,,2007-02-09,2007-02-09-CRParkinDelhi.md,"\n ---\n layout: post\r\n title: ""CR ...","\n ---\n layout: post\r\n title: ""CR ...",success
6,7,2007-02-05 15:11:36,Super Bowl XLI over Coffee in Chanakyapuri,I met up with some friends after work on Frida...,2007-02-05,2007-02-05-SuperBowlXLIover.md,"\n ---\n layout: post\r\n title: ""Sup...","\n ---\n layout: post\r\n title: ""Sup...",success
11,12,2007-02-12 23:06:23,"""My Cave""",I must admit that I had little idea about what...,2007-02-12,2007-02-12-MyCave.md,"\n ---\n layout: post\r\n title: """"My...","\n ---\n layout: post\r\n title: """"My...",success
