# Week 2 - Google Analytics and Python

## 1. Introduction to Python
Python was created in 1990 by Guido van Rossu as a general-purpose and high-level programming language. It has become extremely popular over the past decade because of its intuitive nature, flexibility, and versatility. According to the Developer Nation's recent 30,000 developer survey, Python is among the top three programming language choices of 2023. Python was rated the most popular in data science, machine learning, and artificial intelligence.

I hope that this class can show you the charm of phython and motivate you to continue the learning of this great programming language and its associated libraries for data science and machine learning.

### 1.1 Variables/Data structure
Let's try to create a variable x. The equality sign means "assignment". Variable names can be anything as long as:
- it contains only letters, numbers, or underscores.
- the first character is not a number.
- the variable name is not one of the reserved keywords.

While the variable names can be of anything, it is recommended that you use either of these:
- camel case (e.g., variableName)
- snake case (e.g., variable_name)



### 1.2 Data Type & Operators
The primary data types within Pyton are integers, floats, and strings. They can be further stored in Python as lists, tuples, and dictionaries. 

In [1]:
x=30 #integer
y=1.23 #float
usernames = "usernames=Grammarly" #string

Operators are symbols that perform functions. 


In [2]:
A=2
B=3
C=A+B # the addition operation: +
print(C)

5


In [3]:
A=4
B=3
C=A-B
print(C) # the subtraction operator: -

1


In [4]:
A=3
B=2
C=A*B
print(C) # the multiplication operator: *

6


In [5]:
A=9
B=3
C=A/B
print(C) # the division operator: /

3.0


### 1.3 Lists and Dataframe
The way data is stored is called its structure.   
#### 1.3.1 List
Lists are collections of items.

In [8]:
#lists

dimensions=[] # an empty list.
dimensions=["city","date","source","medium"] # a list.

City=dimensions[0] #Python is a zero-based language.

print(City)
dimensions.append("country")

print(dimensions)
dimensions.remove("source")

print(dimensions)

apple
['apple', 'pear', 'orange', 'banana', 'watermelon']
['apple', 'orange', 'banana', 'watermelon']


#### 1.3.2 Dictionary
Dictionaries hold data that can be retrieved with reference items, or keys.

In [9]:

query_params={} # an empty dictionary
query_params = {'query': 'GenAI OR AI Marketing'
                ,'tweet.fields': 'author_id'
                ,'user.fields':'username'
                ,'start_time':'2024-02-22T10:14:49Z'
                }

query_params['end_time']='2024-02-23T10:14:49Z'

print(query_params)
print(query_params["query"])

100


In [None]:
query_params={} # an empty dictionary
query_params = {'query': {'keyword':'GenAI'}
                ,'tweet.fields': 'author_id'
                ,'user.fields':'username'
                ,'start_time':'2024-02-22T10:14:49Z'
                }

# How to access the keyword?

#### 1.3.3 Dataframe
A Pandas DataFrame is a 2 dimensional data structure, like a table with rows and columns. Did you notice that we used the command ```import pandas as pd``` in the code? This is the very first time for us to import a package. Packages are an essential building block in programming. Without packages, we’d spend lots of time writing code that’s already been written. Finding and using the right package is key to effectively completing your task. 

Pandas is an open-source software library built on top of Python specifically for data manipulation and analysis. Pandas offers data structure and operations for  powerful, flexible, and easy-to-use data analysis and manipulation. Pandas strengthens Python by giving the popular programming language the capability to work with **spreadsheet-like data** enabling fast loading, aligning, manipulating, and merging, in addition to other key functions. 

In [1]:
import pandas as pd

edge=[{'source':'Max','target':'Miranda'}
    ,{'source':'Miranda','target':'Ben'}
    ,{'source':'Miranda','target':'Talha'}] 


#load data into a DataFrame object:
edge_df=pd.DataFrame(edge)

print(edge_df) 



    source   target
0      Max  Miranda
1  Miranda      Ben
2  Miranda    Talha


In [12]:
#index one column of a dataframe
edge_df["source"]

0    420
1    380
2    390
Name: calories, dtype: int64

In [13]:
#index one row 
edge_df.iloc[0]

calories    420
duration     50
Name: 0, dtype: int64

In [14]:
edge_df[edge_df["source"] =="Miranda"] 

Unnamed: 0,calories,duration
0,420,50


**Exercise**
Can you create a list of dictionaries, with each dictionary containing three key-value pairs. The first key is "source", the second key is "target", and the third key is "weight":  

| source      | target |  Weight |
| ----------- | ----------- |----|
| A           | B       |  20 |
| C           | B        | 14|

### 1.4 Conditional statements & Control flow statement

As we write programs, we will need to carry out specific actions based on certain conditions (==, <, <=, >, >=, !=). Conditional statements are used to evaluate whether these certain conditions are being met.

The comparison operators can be combined with different logical operators (and, or, not)

In [15]:
#if statement

A=10
B=20
if A<B:
    print("A is less than B")
else:
    print("B is less than A")

#for loops 

iterable=[1,2,3]
for x in iterable:
    print(x)

A is less than B
1
2
3


The ```iterrows()``` method generates an iterator object of the DataFrame, allowing us to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).

In [None]:
# for loop over each row of a dataframe
for index, row in edge_df.iterrows():
    print(index, row['source'], row['target'])


**Exercise**: I have a dictionary named ```tweets```. This dataframe has one key ```username``` and another key ```mention```. The value for the key of ```mention``` is a list of usernames. Can you create a dataframe to represent the mentioning relationship between the column of ```username``` and each user being mentioned?

| username      | mention |
| ----------- | ----------- |
| A           | [B,C,D]       |
| C           | [A,D]        |


### 1.5 Functions
Functions are prewritten blocks of code that can be invoked to carry out a certain set of actions. For example, print() is a function. You can call functions in multiple ways. 
- The most intuitive way is to use the function name, followed by parantheses.
- Another way is to use "dot notation" by placing a period before the name of the function and after a specific object. For example, 
```
target_object.function_name()
```
- Sometimes we need to provide the function with certain variables or data values. They are called "parameters" or "arguments". They are passed to the function by putting them within a set of parentheses that follows the function name. For example, 
```python
print("Hello!")
```

#### 1.5.1 Create Your Own Function
In order to tell Python that you would like to create a function, you can use the def keyword. After the def keyword,you provide function name and any arguments you function will make use of. Then you can begin writing the commands.
```python
def name(parameters):
    Code to carry out desired actions.
```
Your functions will often require another keyword, the **return** keyword, to specify an expression, variable, or value you would like the function to pass back out to the main program once the function has finished running.
```python
def name(parameters):
    Code to carry out desired actions.
    return desiredExpression
``` 

If your function returns a value, you can assign that value to a variable by calling the function and assigning it to a variable.

```
returned_value=function_used(list of parameters)
```


Below is a function ```create_url```. In producing the output, the function of ```.format``` was used. This function allows a placeholder for a text.

These examples would help you understand how to use it. The placeholder can be specified by name, number, or not defined.
```python
txt1 = "My name is {fname}, I'm {age}".format(fname = "John", age = 36)
txt2 = "My name is {0}, I'm {1}".format("John",36)
txt3 = "My name is {}, I'm {}".format("John",36)
```
With this knowledge, can you predict the output of the following snippet?

In [None]:
def create_url():
    usernames = "usernames=Grammarly"
    url = "https://api.twitter.com/2/users/by?{}".format(usernames)
    return url

my_url=create_url()
print(my_url)

## 2. Google Analytics Data Collection

Google Analytics provides APIs for others to retrieve data in a flexible way. The flexible report generation is based on the definition of **dimensions** and **metrics**.
More information about dimensions and metrics can be found [here](https://support.google.com/analytics/answer/9143382?hl=en#zippy=%2Cattribution%2Cdemographics%2Cecommerce%2Cevent%2Cgaming%2Cgeneral%2Cgeography%2Clink%2Cpage-screen%2Cplatform-device%2Cpublisher%2Ctime%2Ctraffic-source%2Cuser%2Cuser-lifetime%2Cvideo%2Cadvertising%2Cpredictive%2Crevenue%2Csearch-console%2Csession).

In [14]:
#!pip3 install google.analytics.data

The major package we use is **google.analytics.data**

In [2]:
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import (
    DateRange,
    Dimension,
    Metric,
    RunReportRequest,
)
import os
import pandas as pd
import json

Below is a function ```sample_run_report```. The parameter is the property id, and the dimensions, metrics, and date_ranges can be specified below.

In [9]:
def sample_run_report(property_id="424145747"):
    """Runs a simple report on a Google Analytics 4 property."""
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'apt-port-251804-905e08b9e9e3.json'
    client = BetaAnalyticsDataClient()

    request = RunReportRequest(
        property=f"properties/{property_id}",
        dimensions=[Dimension(name="city"),Dimension(name="date")], #Dimension(name="browser"),
        metrics=[Metric(name="eventCount")],
        date_ranges=[DateRange(start_date="2024-01-01", end_date="today")],
    )
    response = client.run_report(request)
    return response

The default output is json, which we can transform into a dataframe using the following function **response_to_df**.

In [10]:
def response_to_df(response):
    columns = []
    rows = []
     
    for col in response.dimension_headers:
        columns.append(col.name)
    for col in response.metric_headers:
        columns.append(col.name)
     
    for row_data in response.rows:
        row = []
        for val in row_data.dimension_values:
            row.append(val.value)
        for val in row_data.metric_values:
            row.append(val.value)
        rows.append(row)
    return pd.DataFrame(rows, columns=columns)


In [11]:
response=sample_run_report(property_id="424145747")
df=response_to_df(response)

print(df)

               city      date eventCount
0            Dallas  20240305        315
1   University Park  20240311        185
2            Dallas  20240312        132
3            Dallas  20240302        104
4        Richardson  20240304         82
5            Dallas  20240221         68
6            Dallas  20240303         61
7            Dallas  20240222         45
8            Dallas  20240306         42
9            Dallas  20240311         42
10           Dallas  20240310         41
11           Dallas  20240304         40
12           Dallas  20240314         33
13           Dallas  20240220         26
14       Richardson  20240221         21
15           Dallas  20240309         19
16         McKinney  20240309         17
17       Richardson  20240124         15
18       Richardson  20240120         13
19        (not set)  20240203          9
20        (not set)  20240313          9
21           Dallas  20240313          8
22  University Park  20240312          8
23        (not s

## 3. Paired t-test: 
Used to compare the means of the same group at two different times or under two different conditions.

In [15]:
from scipy import stats
df['eventCount'] = df['eventCount'].astype(int)


group_a = df[df['city'] == 'Richardson']['eventCount']
print(group_a)
group_b = df[df['city'] != 'Dallas']['eventCount']
print(group_b)
# Perform the t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)

# Output the results
print("T-statistic: {}, P-value: {}".format(t_stat,p_value))

4     82
14    21
17    15
18    13
25     7
27     6
37     4
Name: eventCount, dtype: int32
1     185
4      82
14     21
16     17
17     15
18     13
19      9
20      9
22      8
23      7
24      7
25      7
26      6
27      6
28      5
30      5
31      5
32      5
33      4
34      4
35      4
36      4
37      4
38      3
40      2
41      2
Name: eventCount, dtype: int32
T-statistic: 0.2789751962336633, P-value: 0.7821165616840785
