# 1. The “Friends and Family” data set

> An experiment was designed in 2011 to study (a) how people make decisions, with emphasis on the social aspects involved, and (b) how we can empower people to make better decisions using personal and social tools. The data set was collected by Nadav Aharony, Wei Pan, Cory Ip, Inas Khayal, and Alex Pentland.

More details are available on http://realitycommons.media.mit.edu/friendsdataset.html

> The subjects were members of a young-family residential living community adjacent to a major research university in North America. All members of the community are couples, and at least one of the members is affiliated with the university. The community is composed of over 400 residents, approximately half whom have children. A pilot phase of 55 participants was launched in March 2010. In September 2010, phase two of the study included 130 participants, approximately 64 families. Participants were selected out of approximately 200 applicants in a way that would achieve a representative sample of the community and sub-communities.

In the ``data-fnf`` directory, we provide a data set with 129 users interacting with each other:

In [None]:
!ls data-fnf/records/

Each CSV file contains call and text metadata records belonging to a single user:

In [None]:
!head "data-fnf/records/fa10-01-01.csv"

# 2. Export indicators for every user

With bandicoot, it is easy to load all the users and automatically compute their indicators.

In [None]:
import bandicoot as bc
from tqdm import tqdm_notebook as tqdm  # Interactive progress bar
import glob
import os

import pandas
import numpy as np

In [None]:
# Load a user and returns all its indicators

def make_features(user_id):
    user = bc.read_csv(user_id, "data-fnf/records/",
                       attributes_path="data-fnf/attributes/",
                       describe=False, warnings=False)

    return bc.utils.all(user, summary='extended', split_day=True, split_week=True)

In [None]:
# Loop over all CSV files in /data-fnf/records and call make_features

all_features = []

for f in tqdm(glob.glob("data-fnf/records/*.csv")):
    user_id = os.path.basename(f[:-4])  # Remove .csv extension
    all_features.append(make_features(user_id))

# Export all features in one file (fnf_features.csv)
bc.io.to_csv(all_features, 'fnf_features.csv')

# 3. Gender classification

The data set provided contains both metadata records and gender for each user. Let's try to predict the gender from the indicators we computed.


In [None]:
# Load the features and attributes in a table, using the pandas library

df = pandas.read_csv('fnf_features.csv')
df.head()

We create two objects:

- the array ``y`` contains the labels we want to predict (male/female),
- the matrix ``X`` contains the features for all users (one column for one feature, one line for one user).

In [None]:
# 1. We convert gender labels to binary values (zero or one):
y = (df.attributes__gender == 'male').values.astype(np.int)

y

In [None]:
# 2. We remove columns with reporting variables and attributes (the first 39 and the last 2):
df = df[df.columns[39:-2]]
X = df.values

X

In [None]:
from sklearn import svm, linear_model, ensemble, neighbors, tree
from sklearn import metrics, cross_validation, preprocessing

In [None]:
# 3. We impute missing values in the features
imp = preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(df)

X = imp.transform(df)

In [None]:
# 4. Preprocess data (center around 0 and scale to remove the variance)
scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(X)

### Classification with cross-validation

> Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

> The advantages of support vector machines are:
> - Effective in high dimensional spaces.
> - Still effective in cases where number of dimensions is greater than the number of samples.
> - Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
> - Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

> Source: http://scikit-learn.org/stable/modules/svm.html


In [None]:
# 5. Divide records in training and testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)

# 6. Create an SVM classifier and train it on 70% of the data set
clf = svm.SVC()
clf.fit(X_train, y_train)

# 7. Analyze accuracy of predictions on 30% of the data set
clf.score(X_test, y_test)

<div class="alert alert-info" role="alert">
    <strong>Question:</strong> Is it a good score? Why?
</div>
### Performance of the algorithm

The [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) helps visualize the performance of the algorithm. Each line corresponds to actual classes (male/female), and each column to predicted classes (male/female).

In [None]:
y_pred = clf.fit(X_train, y_train).predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)

print(cm)

In [None]:
len(X_train), len(X_test)

### Use other classifiers

You can easily use different classifiers in scikit-learn, such as:

- SVM with ``svm.SVC()`` (see above),
- k-nearest neighbors with ``neighbors.KNeighborsClassifier()``,
- random forests with ``ensemble.RandomForestClassifier()``

In [None]:
classifier = ensemble.RandomForestClassifier(random_state=0)

classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)                   