**** Business Problem: A game company wants to create new stage-based customer definitions (personas) based on some characteristics of its customers, and form segments according to these new customer definitions, and estimate how much new customers can earn on average according to these segments.****

****Rule based classification ****

**> Read the data and show general information about the dataset**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

pd.pandas.set_option('display.max_columns', None)

def load_persona():
    df = pd.read_csv("../input/persona-data/persona.csv")
    return df

df = load_persona()
df.head()

In [None]:
def check_df(dataframe, head=5, col_name = "SEX", plot = False):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Tail #####################")
    print(dataframe.tail(head))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show()

check_df(df, plot = True)


* How many unique SOURCE are there? What are their frequencies?

In [None]:
df["SOURCE"].unique()

* How many unique PRICE are there?

In [None]:
df["PRICE"].unique()

* How many sales were made from which PRICE?

In [None]:
df["PRICE"].value_counts()

* How many sales from which country?

In [None]:
df["COUNTRY"].value_counts()

* How much was earned in total from sales by country?

In [None]:
df.groupby('COUNTRY')['PRICE'].sum()

* What are the sales numbers according to SOURCE types?

In [None]:
df.groupby("SOURCE")["PRICE"].count()

* What are the PRICE averages by country?

In [None]:
df.groupby("COUNTRY")["PRICE"].mean()

* What are the PRICE averages by SOURCEs?

In [None]:
df.groupby("SOURCE")["PRICE"].mean()

* What are the PRICE averages in the COUNTRY-SOURCE breakdown?

In [None]:
df.groupby(["COUNTRY","SOURCE"])["PRICE"].mean()

* What are the total gains broken down by COUNTRY, SOURCE, SEX, AGE?

In [None]:
df.groupby(["COUNTRY", "SOURCE", "SEX", "AGE"]).agg({"PRICE": "mean"})

* Sort the output by PRICE.

In [None]:
agg_df = df.groupby(["COUNTRY", "SOURCE", "SEX", "AGE"]).agg({"PRICE": "mean"}).sort_values("PRICE", ascending=False )
agg_df

* Convert the names in the index to variable names.

In [None]:
agg_df = agg_df.reset_index()
agg_df.head()

* Convert age variable to categorical variable and add it to agg_df.

In [None]:
agg_df["AGE_CAT"] = pd.cut(df["AGE"],bins = [0, 18, 24, 30, 40, 70], labels= ["0_18", "19_23", "24_30", "31_40", "41_70"])
agg_df.head()

* Identify new level-based customers.

In [None]:
col_names = ["COUNTRY", "SOURCE", "SEX", "AGE_CAT"]
agg_df[col_names]
agg_df["customers_level_based"] = ["_".join(row).upper() for row in agg_df[col_names].values]
agg_df.head()
agg_df["customers_level_based"].value_counts()
#singularize
agg_df = agg_df.groupby("customers_level_based").agg({"PRICE":"mean"})
agg_df = agg_df.reset_index()
agg_df.head()
agg_df["customers_level_based"].value_counts()

* Segment your new customers.
* Divide new customers
* (Example: USA_ANDROID_MALE_0_18) into 4 segments according to PRICE.
* Add the segments to agg_df as variable with SEGMENT naming.
* Describe the segments (Group by segments and get the price mean, max, sum).
* Analyze C segment (only extract C segment from dataset and analyze).

In [None]:
agg_df["SEGMENT"] = pd.qcut(agg_df["PRICE"], 4, labels=["D","C","B","A"])
agg_df

In [None]:
agg_df.groupby("SEGMENT").agg({"PRICE" : ["mean", "max", "sum"]})

In [None]:
agg_df[agg_df["SEGMENT"] == "C"].describe().T

* Classify new customers according to their segments and estimate how much income it can generate.

For example, which segment does a 33 year old Turkish woman using android belong to?

In [None]:
new_user = "TUR_ANDROID_FEMALE_31_40"
agg_df[agg_df["customers_level_based"] == new_user]

* Which segment does a 26 year old American man using IOS belong to?

In [None]:
new_user2 = "USA_IOS_MALE_24_30"
agg_df[agg_df["customers_level_based"] == new_user2]