### Retrieve the data

In [0]:
import urllib

In [0]:
dataset_link = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
dataset_storage_location = '/tmp/aclimdb.tar.gz'
output_dir='/tmp/aclimdb'

In [0]:
urllib.request.urlretrieve(dataset_link, dataset_storage_location)

Out[3]: ('/tmp/aclimdb.tar.gz', <http.client.HTTPMessage at 0x7fb437717520>)

In [0]:
import os
os.path.exists(dataset_storage_location)

Out[4]: True

### Unpack it to the local folder

In [0]:
import tarfile
tar = tarfile.open(dataset_storage_location)
tar.extractall()
tar.close()

In [0]:
sorted(os.listdir('./aclImdb/'))

Out[6]: ['README', 'imdb.vocab', 'imdbEr.txt', 'test', 'train']

### Understand what is inside the data folder

In [0]:
sorted(os.listdir('./aclImdb/train'))

Out[7]: ['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [0]:
sorted(os.listdir('./aclImdb/train/pos'))

Out[8]: ['0_9.txt',
 '10000_8.txt',
 '10001_10.txt',
 '10002_7.txt',
 '10003_8.txt',
 '10004_8.txt',
 '10005_7.txt',
 '10006_7.txt',
 '10007_7.txt',
 '10008_7.txt',
 '10009_9.txt',
 '1000_8.txt',
 '10010_7.txt',
 '10011_9.txt',
 '10012_8.txt',
 '10013_7.txt',
 '10014_8.txt',
 '10015_8.txt',
 '10016_8.txt',
 '10017_9.txt',
 '10018_8.txt',
 '10019_8.txt',
 '1001_8.txt',
 '10020_8.txt',
 '10021_8.txt',
 '10022_7.txt',
 '10023_9.txt',
 '10024_9.txt',
 '10025_9.txt',
 '10026_7.txt',
 '10027_7.txt',
 '10028_10.txt',
 '10029_10.txt',
 '1002_7.txt',
 '10030_10.txt',
 '10031_10.txt',
 '10032_10.txt',
 '10033_10.txt',
 '10034_8.txt',
 '10035_9.txt',
 '10036_8.txt',
 '10037_9.txt',
 '10038_10.txt',
 '10039_10.txt',
 '1003_10.txt',
 '10040_10.txt',
 '10041_10.txt',
 '10042_10.txt',
 '10043_10.txt',
 '10044_9.txt',
 '10045_10.txt',
 '10046_9.txt',
 '10047_10.txt',
 '10048_10.txt',
 '10049_8.txt',
 '1004_7.txt',
 '10050_10.txt',
 '10051_10.txt',
 '10052_10.txt',
 '10053_8.txt',
 '10054_10.txt',
 '10

In [0]:
with open('./aclImdb/train/pos/0_9.txt') as f:
    print(f.readlines())

['Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!']


### Read the data into a dataframe

In [0]:
import glob

top_level_directories = ['train', 'test']
classifications = ['pos', 'neg']

starting_directory = './aclImdb'


data = []
for tld in top_level_directories:
    for classification in classifications:
        for entry in glob.glob(f'{starting_directory}/{tld}/{classification}/*.txt'):
            with open(entry) as f:
                text = ' '.join(f.readlines())
            data.append({
                'text': text,
                'classification': classification,
                'type': tld
            })

In [0]:
sdf = spark.createDataFrame(data)

In [0]:
display(sdf)

classification,text,type
pos,"There have been countless talking-animal films in the past, the majority of which either feature animals' mouths digitally animated to nearly match the voice acting, or are ridiculously amateur. 'Homeward Bound: The Incredible Journey' is neither. This film doesn't need the infant-pleasing addition of moving canine lips, or gesturing feline limbs. It has the ability to make you believe that the animals are authentically talking to one another, and you can get rather emotionally attached to them at heart (as all great boy-and-and-his-dog films should). Homeward Bound is the epitome of all family-friendly animal romps to me, and boasts some beautiful cinematography, an inspiring soundtrack (from the genius of Bruce Broughton), and an impressive cast... Michael J. Fox ... Chance Sally Field ... Sassy Don Ameche ... Shadow Frank Welker (Voice God) ... Various It is a modernised version of the children's classic work of fiction 'The Incredible Journey', which was made into a semi-documentary film by Disney long long ago in 1963. The sequel (Lost in San Fransisco) isn't nearly as good a film, but extends the adventure of my favourite furry-footed friends, and is a fun urban-twist on the grand-outdoor-adventure theme. Want to entertain your children with a witty, pretty, heart-warming mini-epic, without the idiotic and often utterly ridiculous comedy of modern children's cinema? Parents, buy all three films for your children - now! Thank you, Disney, for bringing a tear to my eyes with each time I watch this early-90s classic!",train
pos,"Tintin and I recently aired as an episode of PBS's P.O.V. series. It's based on a taped interview of Georges Remi a.k.a. Herge, Tintin's creator, from 1971 in which in discusses his various experiences publishing his popular character, first in a Catholic newspaper, then in his own series of comic books. Awesome sweeping views of various comic pages and surreal images of Herge's dreams. I first encountered Tintin in the pages of Children's Digest at my local elementary school library reading The Secrets of the Unicorn. My mom later got a subscription to CD and I read the entire Red Rackham's Treasure every month in 1978. I remember seeing some Tintin comic books in a local book store after that but for some reason I didn't get any probably because I was 12 and I thought I was outgrowing them. I do have Breaking Free, a book written and drawn by J. Daniels, published in 1989, six years after Herge's death. Haven't read it yet. This film also covers the artist's personal life as when he left his first wife after his affair with a colorist in his employ (whom he later married). Her name is Fanny and she is interviewed here. If you love Tintin and his creator, this film is definitely worth a look. Update: 9/4/07-I've now read Breaking Free. Tintin and The Captain are the only regular characters that appear here and they are tailored to the anti-capitalist views of Mr. Daniels with Tintin portrayed as a rabble rouser with a chip on his shoulder who nevertheless cares for The Captain who he's staying with. The Captain here is just trying to make ends meet with a wife and daughter that he loves dearly. They and other construction workers vow to strike after a fellow employee dies from a faulty equipment accident. The whole thing takes place in England with working-class cockney accents intact. Not the kind of thing Herge would approve of but an interesting read nonetheless. Oh, yes, dog Snowy only appears in the top left corner of the cover (which has Tintin running over the police!) and the dedication page.",train
pos,"I saw this movie the day it came out last year. Hilarious I thought. Well, now it's on video and I saw it again. I love this movie! The things they do are sometimes dumb but that's what makes it my third favorite movie of all time. The special effects are okay, but the witty dialog will have you rolling. I'm the kind of person that'll say i'm inspired by this movie, so if you like dramas and other stuff, avoid. But for all others, enjoy! The acting is superb. Hank Azaria is hands down the best (he's neither a commie, nor a fruit) followed by Ben Stiller (uh, don't correct me. it sickens me) and then William H. Macy delivering his best performance (outshining fargo) Everybody has praised everyone from macy to garafalo, but I think Kel Mitchell was pretty good as Invisible Boy. Two problems: The most boring part of the film is the subplot of the romance between Stiller and Claire Forlani, and the Casanova parole hearing. Some scenes absolutely advance the story in no way, but they're a blast. Kinka and especially the writers tend to drag on a scene untill all it's hilarity is gone, but bam they switch and you're ready for more. I swear after seeing this, you will be tired from the explosive climax (which I think was pretty cool) The camera is pretty cool also, moving at a furious pace with the actors. Also, Tom Waits delivers an outstanding performance (he has this kinda cool bad hero coolness to him) and like someone else said, the best parts are when the characters show some humanness to them. Captain Amazing is pretty funny, (especially his speech to Casanova about his perfect plan-I was rolling) and rush is pretty cool as Casanova. One beef: the funniest comedian ever (eddie izzard) is almost wasted, but his heart is in the right place. So all in all, a wonderful movie. I give it twenty stars and hope that someday, everyone will see the brilliance in the film's best parody, the Six Million Dollar Man one. Laughing right now as I think about it. 20/10",train
pos,"Simon Pegg plays the part of Sidney Young, a young entertainment writer who has begun the beginnings of a career writing for a grassroots magazine that specializes in badmouthing the shallowness and superficiality of the rich and famous. He is making a career out of lampooning celebrities, although he has a desperate wish to be a celebrity himself. The movie is based on the very bizarre career of Toby Young, who also ran a small magazine in Britain called the Modern Review, which offered scathing criticism of pretty much everything imaginable, until he closed the magazine in a hail of verbal bullets with his co-editor, and then went on to a spectacularly failed career as a writer for Vanity Fair, which is pretty much the part of his life told in this movie. He is at first thrilled to go work for a major publication (called Sharp's Magazine in the movie), and despite active nerves he is positively beaming on his first day. He meets the chief editor, Clayton Harding (played by Jeff Bridges), who is hard as nails but who is also exactly the kind of editor he needs to be for a goof-off like Young to keep his job at the magazine. He offers little in the form of immediate acceptance of Young, but he also has what can only be described as a liberal tolerance of Young's off-the-wall antics and inappropriate behavior. Much of the comedy in the movie is derived from Young's misunderstanding of or indifference to the generally accepted code of public behavior and the peculiar etiquette involved in dealing with the rich and famous. But Sidney's reasons for acting in such a weird way and for giving outwardly offensive interviews is because he believes that he loathes the entire celebrity culture and, it would seem, he believes in that age-old saying  'If you can't beat 'em, join 'emand THEN beat 'em."" Complicating matters are two very different women. There is a charming, regular girl at the magazine named Alison Olsen (Kirsten Dunst) who at first is appalled by Sidney's obvious arrogance and womanizing ways, and a stunning model named Sophie (Megan Fox), who represents the celebrity culture. Needless to say, Sidney's endless attack of superficiality and stardom is a superficial lust for Sophie, the one with the look of a star. Sophie is stunningly beautiful, it's true, but also comes across as having not a single thought rattling around in her head. Alison is a regular girl, not very interesting or attractive, but Dunst's performance makes her a real person. A relationship with her would have all the reality of a Britney Spears marriage, and yet the movie retains some level of believability because, despite how obvious this is, we also feel Sidney's pain in not pursuing her (I felt it, anyway). How To Lose Friends and Alienate People has a pretty interesting premise and is full of honest, satisfactory performances, and although it turns into a bit of your standard romantic comedy by the third act, it has a variety of well-developed and interesting characters. Danny Huston, for example, gives us a great performance as Alison's other love interest, who pays homage to The Big Lebowski (also starring Bridges) with his ever-present White Russian, one of my personal favorite drinks. Buying Absolute and Kahlua here in China costs the equivalent of about $350, but my kitchen is never without them. I am looking forward to the day when Simon Pegg will branch out a little bit, because I love his films but I am completely unsure about his range. He played a serious character in Hot Fuzz, but only serious in relation to the lunacy surrounding him, and ultimately went back to being himself again, which he has pretty much been in Shaun of the Dead, Run, Fat Boy, Run, and now How To Lose Friends. He's a rising star, it will be interesting to see what else he can do.",train
pos,I found this very touching as Spike and Heaton stay together all the way through this film not to say there isn't a few betrayals along the way. I thought the chase was put aside the relationship between the two was foreground I think. I had already guessed that there were so gay intentions on the part of Heaton. My favourite scene had to be the bit where Heaton and Spike were stuck in the marsh and Spike runs off I generally thought Spike wasn't coming back. I have to say that if it wasn't for our film studies teacher making us watch this I would have probably never seen it. Overall I thought this film was pretty good and I would recommend it to any person who is a fan of British made films.,train
pos,"I saw an advanced screening for this movie tonight. I absolutely loved it. The movie kept me on the edge of my seat all night. Cillian Murphy is extremely creepy as the villain. For those of you who have seen Batman Begins, his character was much scarier in this film. He played his character very well. The scariest ""bad guy,"" I have seen in awhile. Rachel McAdams was great. Everyone in the audience laughed, gasped and cheered at the same time, as if we were on cue. The suspense is held through out the movie. THe amazing part is that the end was not anti-climatic. I was not disappointed in the end. I felt satisfied. The trailer does not do the movie justice. The movie is much better than the trailer indicated. Do not wait for this movie to come out on video. Go see it. Although, I did not have to pay to see this movie, I would have gladly given 10.75 to see it. Enjoy!",train
pos,"The young Dr. Fanshawe(Mark Letheren), an avid archaeologist, is dispatched by his Museum boss to the large country home of Squire Richards(Pip Torrens), where his task is to find provenance for and catalogue the collection of antiquities and curios belonging to the recently deceased father of the Squire. The Squire is surprised by the arrival Fanshawe, he hadn't been expecting him for another week, but none the less welcomes him and gets his only servant, Patten (David Burke..of Dr Watson fame), to show him to his room, as Fanshawe must stay over for some days in order to finish his rather large task. Patten it would seem is not the friendliest sort and seems to resent the extra work that Fanshawe's visit will entail, the large empty house providing an endless amount of cooking, cleaning and maintenance for him. Fanshawe is a fussy sort, very neat and precise with everything having its place, whether they be his clothes or his books and papers and he is rather disgusted by the dirt in his room. Needless to say he is rather eager to begin his work, but unpacking he finds his binoculars have been damaged in transit, so he asks the Squire for a replacement pair, The Squire who is a modern thinking man but also it would seem rather uncultured with such matters, is also eager to get rid of the clutter around the house, so he obliges and walks Fanshawe to the top of the hill so that he can survey the estate and the surrounding villages, there the Squire directs him to points of interest, including Gallows Hill, where locals were hung for their crimes and misdemeanours, his interest is also taken by a local abbey which the Squire describes as a ruin, but Fanshawe can see through the binoculars that it clearly isn't, he investigates further and pays a visit to the site of the abbey and is shocked to find that there are but a few stone remnants? Fanshawe doesn't have too much time to think about this conundrum as he darkness falls he feels he is being watched, he feels a presence, he begins to see moving shadows in the woods, startled he runs home. Over dinner he imparts details of his harrowing day to the Squire, Patten overhears the story and suggests an explanation for it..The Binoculars! they used to belong to a local man called Baxter, whom it would seem collected bones and skulls from Gallows Hill, boiling them up for some concoction or other, Baxter had disappeared mysteriously one night, the late Squire had acquired his belongings, including a mask made out of a skull and some old etchings of the area. These etchings fascinate Fanshawe as they portray the Abbey he seen through his binoculars, but he learns that the abbey had been destroyed during the reign of Henry VII and so it would be impossible for Baxter to have drawn the sketches, never the less they are signed and dated by Baxter to the recent past so he concludes that the binoculars have some special power. That night he has horrifically vivid dreams, when he wakes, he sets off with the binoculars to have a closer look at the abbey through them, what he finds surprises him but has he put himself in perilous danger by doing so? Fanshawe finally becomes trapped in his dangerous obsession, as darkness falls the Squire and a search party go in search of the now missing archaeologist, they are alerted by dozens of loudly cawing crows circling above Gallows HIll, they quicken their speed, but will they be in time to help or save Fanshawe from his destiny? The Ghost Story for Christmas series of films made by the BBC sadly ended its initial run of films in 1978 with The Ice House, they were for the most part based on the work of the great M.R. James. In 2005 and 2006 the series was revived briefly and thankfully A View from a Hill also marked a return to the work of James, whose ghostly writings have haunted many generations of readers. Director Luke Watson being new to the series might have worried fans of the older films, but he returns to the period setting abandoned by the later films which immediately sets the tone for a great Ghost story, his direction is assured as he stays true to the mood of the masters works and gradually builds up the fear factor to a terrifying climax, all the while keeping what the viewer sees to a minimum, thus upping the tension and mystery. The Autumn countryside provides oodles of atmosphere, the falling leaves and low lying sun providing an unsettling backdrop for the sinister events to come. The cast it must be said are all superb and are perfectly cast in their respective roles. The idea behind the binoculars is simple but very effective, the use of a man made object to see supernatural beings and events that the naked eye cannot see, may even have influenced Álex de la Iglesia in his film La habitación del niño (2006) of the following year, with which it bears striking similarity. I had heard mixed reviews of this particular film, but i must say i found it at all times intriguing and it even raised a few hairs on my head and gave me a few shivers, something that doesn't happen much these days, i think any negativity surrounding the film can only be attributed to its pacing, which to my eyes is perfection but to modern audiences it will be seen as deathly slow. Plenty of time is given, even within its brief 40 minutes running time, for character development and plot expansion and i must say its a new favourite of mine and certainly one of the better films of the decade.",train
pos,"I remember stumbling upon this special while channel-surfing in 1965. I had never heard of Barbra before. When the show was over, I thought ""This is probably the best thing on TV I will ever see in my life."" 42 years later, that has held true. There is still nothing so amazing, so honestly astonishing as the talent that was displayed here. You can talk about all the super-stars you want to, this is the most superlative of them all! You name it, she can do it. Comedy, pathos, sultry seduction, ballads, Barbra is truly a story-teller. Her ability to pull off anything she attempts is legendary. But this special was made in the beginning, and helped to create the legend that she quickly became. In spite of rising so far in such a short time, she has fulfilled the promise, revealing more of her talents as she went along. But they are all here from the very beginning. You will not be disappointed in viewing this.",train
pos,"Fully deserving its prestigious Hollywood award nomination, this is an entertaining little gem with lots of pizazz and some delightful surprises. Outstandingly funny scenes include an hilarious shoot (and re-shoot) of a WW1 trench scene with Australian comedian Clyde Cook as an optimistic non-com and the hapless McDoakes as a Boyer/Colman messenger  all under the beady eye of Ralph Sanford's delightfully irascible Anguish; a lost McDoakes guided and re-guided by equally perplexed Jack Carson; assistant director Chandler rejoicing in a McDoakes-sent opportunity: ""I'm going to be a director!"" Ace comic O'Hanlon has a dual role, playing both McDoakes and himself playing McDoakes! Oddly, Richard L. Bare who does play himself in one or more other entries in the series, has turned down that opportunity here. In real life, Bare's a youngish, six-foot Rock Hudson lookalike, but here he's impersonated by veteran actor (over 500 movies!), Jack Mower.",train
pos,"What a fascinating film. Even if it wasn't based on real life, Forbidden Lies was a fascinating portrait of a con artist in her element. And it is the kind of film psychology students could study to learn about compulsive liars. The author of Forbidden Love, Norma, was revealed as a fraud in the media but this move really does give her ample opportunity to clear her name. But the twists and turns she takes the documentary maker through are amazing. What a patient woman! I loved this movie. I have not read the book but simply heard good reviews and went to see it on boring rainy afternoon. The journey this film takes you on is clever, interesting and totally engrossing.",train


In [0]:
sdf.write.mode("overwrite").saveAsTable("default.aclImdb_data_in")