The head of marketing at a fashion company XYZ has a theory that male customers are more likely to engage with your product on social media than female customers and has asked you (a Data Scientist) to write an algorithm to predict the gender of Twitter users mentioning their product based on the text of their posts. The marketing head has provided you with a list of TweetIDs for each customer. You have to write a script that turns these lists of IDs into both a score representing how strongly the company believes them to be of a given gender and a prediction about their gender.

Now, the idea for you to get the tweets from the given IDs first. Then extract the text and apply sentiment analysis to predict their genders along with gender scores. This will enable to company to validate their hypothesis (not a statistical hypothesis though). 

I am not going to show any fancy sentiment analysis algorithm here. Instead I will create a small lexicon manually (I know this is terrible) and use it for making predictions. The objective here is to make use of parallel programming in real-life scenarios.

If you are following along, you would need a **Twitter Developer Account**. Along with that the following Python libraries:
- `python-twitter`
- `toolz`
- `multiprocessing`

All of the above-mentioned libraries are pip installable. 

## Setting up

In [2]:
# Import necessary packages
import twitter
from toolz import pipe
from multiprocessing import Pool

In [3]:
# Setup your Twitter API with your credentials
Twitter = twitter.Api(consumer_key="",
            consumer_secret="",
            access_token_key="",
            access_token_secret="")

## Helper functions

In [4]:
# Retrieve the tweets represented by given ids
def get_tweet_from_id(tweet_id, api=Twitter):
    return api.GetStatus(tweet_id, trim_user=True)

# Extract the text from the tweets
def tweet_to_text(tweet):
    return tweet.text

# Split the tweet text w.r.t whitespace. Don't bash! 
def tokenize_text(text):
    return text.split()

# Define the lexicon to score
def score_text(tokens):
    lexicon = {"the":1, "to":1, "and":1, 
    "in":1, "have":1, "it":1,
    "be":-1, "of":-1, "a":-1, 
    "that":-1, "i":-1, "for":-1}
    return sum(map(lambda x: lexicon.get(x, 0), tokens))

In [5]:
# Construct a function pipeline
def score_tweet(tweet_id):
    return pipe(tweet_id, get_tweet_from_id, tweet_to_text,
    tokenize_text, score_text)

In [6]:
from toolz import compose

# Score the users based on their tweets
def score_user(tweets): 
    N = len(tweets) 
    total = sum(map(score_tweet, tweets)) 
    return total/N 

# Get the gender
def categorize_user(user_score): 
    if user_score > 0: 
        return {"score":user_score,
                    "gender": "Male"}
    return {"score":user_score, 
                "gender":"Female"}

## In action

In [8]:
users_tweets = [
    [1056365937547534341, 1056310126255034368, 1055985345341251584,
    1056585873989394432, 1056585871623966720],
    [1055986452612419584, 1056318330037002240, 1055957256162942977,
     1056585921154420736, 1056585896898805766],
    [1056240773572771841, 1056184836900175874, 1056367465477951490,
     1056585972765224960, 1056585968155684864],
    [1056452187897786368, 1056314736546115584, 1055172336062816258,
     1056585983175602176, 1056585980881207297]]

gender_pipeline = compose(categorize_user, score_user)

with Pool() as P:
    print(P.map(gender_pipeline, users_tweets))

[{'score': -0.4, 'gender': 'Female'}, {'score': 0.0, 'gender': 'Female'}, {'score': 0.8, 'gender': 'Male'}, {'score': -0.4, 'gender': 'Female'}]
