#1. https://python-forum.io/Thread-lambda-or-operator

In [12]:
from operator import methodcaller

In [13]:
def ask_valid1(x, conversion=lambda choice: choice.lower()):
    return conversion(x)

In [14]:
def ask_valid2(x, conversion=methodcaller('lower')):
    return conversion(x)

In [15]:
def ask_valid3(x, conversion=str.lower):
    return conversion(x)

In [16]:
%timeit ask_valid1('Name')

335 ns ± 2.85 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [17]:
%timeit ask_valid2('Name')

353 ns ± 6.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [18]:
%timeit ask_valid3('Name')

312 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


I found some thoughts about this [here](https://treyhunner.com/2018/09/stop-writing-lambda-expressions/).

In my opinion, the third version (`str.lower`) looks more pythonic. 

`methodcaller`
If you look at the operator module sources, you find that `methodcaller` is a light-weight class. In any case, 
using it,  you implicitly include additional code to be executed. This, in turn, can slightly increase the time of execution. 

`str.lower` vs `lambda x: x.lower()`
Using `lambda` keyword is redundant in this case, `str.lower` is shorter. 

Lets look at benchmarks: 


Finally, if you have several methods, 
where the `conversion` argument should have default value (a function), 
I would suggest to define something like `_default_conversion` function, e.g. 

In [19]:
def _default_conversion(x):
    # manage all methods 
    # that include conversion function in one place.
    # Might be also useful when debugging.
    return x.lower()

#OR 

_default_conversion = str.lower
    
class A:

    def ask_valid(self, *args, conversion=_default_conversion):
        pass

    def ask_invalid(self, *args, conversion=_default_conversion):
        pass

#2. https://python-forum.io/Thread-proper-syntax-for-itertuples

You don't need to use loops for this task at all. 

Something like this: 

df.loc[df.loc[:, 'Close'] > df.loc[:, 'prev'], 'trade2'] = '+'
df.loc[df.loc[:, 'Close'] < df.loc[:, 'prev'], 'trade2'] = '-'
df.loc[df.loc[:, 'Close'] == df.loc[:, 'prev'], 'trade2'] = df.loc[((df.loc[:, 'trade2'] =='+')|df.loc[:, 'trade2'] == '-').last_valid_index()]['trade2']

should work. 



#3. https://python-forum.io/Thread-Simple-String-to-Time-within-a-pandas-dataframe

In [1]:
import pandas as pd

In [3]:
df = pd.DataFrame({'atime': ['13-06-2019 10:00', '12-06-2019 09:15'], 'x': [1,2]})

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
atime    2 non-null object
x        2 non-null int64
dtypes: int64(1), object(1)
memory usage: 112.0+ bytes


In [9]:
df.atime = pd.to_datetime(df.atime)

In [10]:
df

Unnamed: 0,atime,x
0,2019-06-13 10:00:00,1
1,2019-12-06 09:15:00,2


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
atime    2 non-null datetime64[ns]
x        2 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 112.0 bytes


#4. https://python-forum.io/Thread-How-to-vstack-matrixs-in-numpy-with-different-numbers-of-columns

In [22]:
import numpy as np
a = np.array([["A1", "B1", "C1"], ["A1", "B1", "C1"]])
b = np.array([["A2", "B2"], ["A2", "B2"]])
c = np.array([["A3", "B3"], ["A3", "B3"]])
a = pd.DataFrame(a, columns=None)
b = pd.DataFrame(b, columns=None)
pd.concat([a,b]).fillna('').values

array([['A1', 'B1', 'C1'],
       ['A1', 'B1', 'C1'],
       ['A2', 'B2', ''],
       ['A2', 'B2', '']], dtype=object)

In [23]:
import pandas as pd

In [24]:
a = pd.DataFrame(a, columns=None)
b = pd.DataFrame(b, columns=None)

In [25]:
b

Unnamed: 0,0,1
0,A2,B2
1,A2,B2


In [26]:
pd.concat([a,b]).fillna('').values

array([['A1', 'B1', 'C1'],
       ['A1', 'B1', 'C1'],
       ['A2', 'B2', ''],
       ['A2', 'B2', '']], dtype=object)

#5. https://python-forum.io/Thread-Converting-str-to-binary

In [3]:
mystring = 'My beautiful house'

def word2bin(w):
    return '0b' + ''.join('{0:08b}'.format(ord(x), 'b') for x in w)

bin(sum(map(lambda x: int(x, 2), (word2bin(x) for x in mystring.split()))))

'0b11000100110010101100001011101011101110011011000110111000011011001001010'

#6. https://python-forum.io/Thread-How-to-re-arrange-DataFrame-columns

In [6]:
import pandas as pd

In [7]:
micolumns = pd.MultiIndex.from_tuples([('X', 'foo', '10'), ('X', 'bar', '10'),
                                       ('Y', 'foo', '10'), ('Y', 'bar', '10')],
                                      names=['l0', 'l1', 'l2'])

In [8]:
micolumns

MultiIndex(levels=[['X', 'Y'], ['bar', 'foo'], ['10']],
           labels=[[0, 0, 1, 1], [1, 0, 1, 0], [0, 0, 0, 0]],
           names=['l0', 'l1', 'l2'])

In [30]:
arr = pd.DataFrame(pd.np.arange(12).reshape(3,4), columns=micolumns)

In [16]:
arr

l0,X,X,Y,Y
l1,foo,bar,foo,bar
l2,10,10,10,10
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [17]:
arr.reset_index()

l0,index,X,X,Y,Y
l1,Unnamed: 1_level_1,foo,bar,foo,bar
l2,Unnamed: 1_level_2,10,10,10,10
0,0,0,1,2,3
1,1,4,5,6,7
2,2,8,9,10,11


In [29]:
arr.T.reset_index()

Unnamed: 0,l0,l1,l2,0,1,2
0,X,foo,10,0,4,8
1,X,bar,10,1,5,9
2,Y,foo,10,2,6,10
3,Y,bar,10,3,7,11


#7. https://python-forum.io/Thread-Help-understanding-Bioinformatics-question

You need to divide your problem into a set of small ones. Underlying math isn't quite hard: Jaccard coefficient (similarity) is a fraction of the measure of an intersection of two sets and the measure of a union of them. Jaccard distance seems to be (1 - Jaccard coefficient).

So, you need to implement a function that traverse the specified directory and returns 
data loaded from two files. This function could be implemented as a generator. 
This generator will yield a new pair of data until all possible combinations
be traversed (s(s-1)/2, where s is the number of files in the folder).  


Below is a sketch of the solution; completely not tested but might be helpful...


In [None]:
def traverse_dir(path='.'):
    # some code goes here, probably you'll need to use os.path.walk
    yield df1, df2, filenames  # df1, df2 assumed to be pandas dataframes; each dataframe has two columns
    
   
    
def get_jaccard(df1, df2, n=3):
    """Return Jaccard distance between two dfs of specified form
    
    Parameters
    ==========
        
        :param df1: Pandas data frame
        :param df2: Pandas data frame
        :param n: an integer, the number of most frequent ... to use
    
    
    Notes
    =====
    
    df1, df2 assumed to have the following form:
    
    df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]})
    df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC'], 1:[3, 4, 8]})
    
    Pandas assumed to be imported as pd.
    """
    
    d1 = df1.sort_values(by=1, ascending=False)[:n]
    d2 = df1.sort_values(by=1, ascending=False)[:n]
    common = pd.np.intersection1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values)
    a = d1[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    b = d2[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    comm_measure = pd.np.vstack([a,b]).min(axis=0).sum()
    all_unique = ... # write something here... 
    
    return  (1 - comm_measure / all_unique)


requested_path = input('Enter path:')
n = input('Enter n:')

# and something like this... 
for a, b, filenames in traverse_dir(requested_path):
    print("Processing files: {}, result={} ".format(filenames, get_jaccard(a, b, n=n)))


In [215]:
import pandas as pd
df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]})
df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC', 'AGT'], 1:[3, 4, 8, 8]})

In [265]:
def get_jaccard(df1, df2, n=3):
    """Return Jaccard distance between two dfs of specified form
    
    Parameters
    ==========
        
        :param df1: Pandas data frame
        :param df2: Pandas data frame
        :param n: an integer, the number of most frequent ... to use
    
    
    Notes
    =====
    
    df1, df2 assumed to have the following form:
    
    df1 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA'], 1:[4, 8]})
    df2 = pd.DataFrame({0: ['AACCTTGG', 'CCTTGGA', 'ACC', 'AGT'], 1:[3, 4, 8, 8]})
    
    Expected value for df1 and df2:
        get_jaccard(df1, df2, n=3) should return:
        1 - 4 / (8 + 8 + 8 + 4)
    
    Explanation
    -----------
       df2 3 most frequent features are ['CCTTGGA', 'ACC', 'AGT']
       df1 3 most frequent features are ['AACCTTGG', 'CCTTGGA']
       common features: [CCTTGGA]
       
       Jaccard = measure(intersection)/measure(union)
       
       Let "measure = the number of fragments"
       Ok, 'CCTTGGA' count in df2 = 4, 
       'CCTTGGA' count in df1 = 8,
       
       Measure of intersection: min(4, 8) = 4
       If we had several common fragmens, we would
       computed their sum, e.g. min(a1, b1) + min(a2, b2) etc.
       Here we have the only one: 'CCTTGGA';
       
       mes. of union: count of only df1 features + count of only df2 features +
                      max(counts of common_features)
       
       # Note: we consider only 3 most frequent features!
       count of only df1 features: 4
       count of only df2 features: 8 (ACC) + 8 (AGT)
       max(counts of common_features): max(4, 8)
       
       So, we got:
         Jaccard similarity = 4 / (8 + 8 + 8 + 4)
         
         and, finally:
         
         Jaccard dist. = 1 - Jaccard similarity

    Pandas assumed to be imported as pd.
    """
    
    d1 = df1.sort_values(by=1, ascending=False)[:n]
    d2 = df2.sort_values(by=1, ascending=False)[:n]
    common = pd.np.intersect1d(d1.iloc[:, 0].values, d2.iloc[:, 0].values)
    d1_only_features = set(d1.iloc[:, 0].values) - set(common)
    d2_only_features = set(d2.iloc[:, 0].values) - set(common)
    a = d1.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    b = d2.loc[d1.iloc[:, 0].isin(common) & d2.iloc[:, 0].isin(common)].iloc[:, 1].values
    comm_measure = pd.np.vstack([a,b]).min(axis=0).sum()
    comm_measure_max = pd.np.vstack([a,b]).max(axis=0).sum()
    d2_only_measure = d2.loc[d2.iloc[:, 0].isin(d2_only_features)].iloc[:, 1].sum()
    d1_only_measure = d1.loc[d1.iloc[:, 0].isin(d1_only_features)].iloc[:, 1].sum()
    total =  d1_only_measure + d1_only_measure +comm_measure_max
    return  (1 - comm_measure / total)

In [266]:
get_jaccard(df1, df2)

0.8571428571428572

#8. https://python-forum.io/Thread-pandas-convert-Int-to-str

In [31]:
import pandas as pd

In [32]:
df = pd.DataFrame({'x': [1,2,3,4,5]})

In [33]:
df.x.str

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

In [30]:
df = pd.DataFrame({'x' : [1,2,3,4]})
df.x = df.x.astype(str)
df.x.str #  No errors!

<pandas.core.strings.StringMethods at 0x7f6f716ee160>

In [1]:
import pandas as pd
df = pd.DataFrame({'x': ['02', '34', '02'], 'y':[1,2,3]})
df[df.x.str.startswith('02')].y.sum()

In [5]:
df = pd.DataFrame({'x': [02, 34, 02], 'y':[1, 2, 3]})

SyntaxError: invalid token (<ipython-input-5-a49b2b277d48>, line 1)

#9.  https://python-forum.io/Thread-How-to-use-the-excel-filename-as-a-value-to-populate-new-column-using-Pandas

In [None]:
import pandas as pd
df = pd.DataFrame({'x': pd.np.arange(1,2,1)})
df['y'] = 'filename'

f your data frame is defined and prefilled, you can create a new column and fill it as follows df ['Date'] = 'desired value'. This automatically creates Date column in the data frame and fills it with desired value.

If you want to incorporate data from all excel files into one data frame, your aglorithm might be the following:

    Define an empty list, e.g., named acc (it will be used later).
    Use os.path.walk or glob.glob to iterate over all files;
    Load data from each file using pandas, e.g. pandas.read_csv, pandas.read_excel;
    Once you iterating files, you know their names; Let filename is the current filename of a file being loaded into df; You can just do df['Dates'] = filename;
    Append each df to acc list;
    Use pd.concat to combine all dfs stored in acc into a new data frame.


#10.https://python-forum.io/Thread-to-numpy-works-in-jupyter-notebook-but-not-in-python-script

In [1]:
import pandas as pd
import numpy as np
 
# SWAPPING COLUMNS
 
dates=pd.date_range('1/1/2019',periods=12)
df=pd.DataFrame(np.random.randn(12,4),index=dates,columns=['A','B','C','D'])
 
df_copy=df.copy()
# assign, after converting to raw data
df_copy.loc[: ['B', 'A']]=df_copy[['A','B']].to_numpy()
print(df_copy)

AttributeError: 'DataFrame' object has no attribute 'to_numpy'

In [5]:
df.values

array([[-0.61417772,  0.06380079,  1.15613779, -0.28688711],
       [ 0.01909134, -0.66876421, -1.01488265, -0.21196262],
       [ 0.57949582,  1.25926954,  0.40566515, -0.41108333],
       [ 1.44020932, -0.56669488,  0.27853978,  1.37395866],
       [-0.41777311,  3.33644774, -1.72722513, -1.07309577],
       [ 0.03072395, -0.73953554,  1.82514596,  0.33596415],
       [-0.96578259, -0.4898714 , -0.14279001, -0.81185085],
       [-0.55274676,  0.87616439, -1.44815004, -0.08169525],
       [ 1.25463458,  1.12969333,  2.10538213,  1.89016403],
       [-0.0045278 ,  0.59043533, -0.44581276, -0.72068009],
       [-0.32812966, -0.21815762,  1.1946445 ,  0.56836199],
       [ 0.7632214 , -1.59373288,  0.96649909,  0.53527624]])

In [7]:
pd.__version__

'0.23.4'

As it could be seen from Pandas official docs `.to_numpy` was added in v.0.24.0.

#11. https://python-forum.io/Thread-nltk-Relations-Extractor

In [139]:
import nltk
import re 
from nltk.chunk import ne_chunk_sents, ne_chunk
from nltk.sem import relextract
sent = "China is really in Asia and not presented. Russia is in Asia. What is really happend here? Japan in Asia"
sent = nltk.pos_tag(sent.split())

In [141]:
IN = re.compile(r'.*\bin\b(?!\b.+ing)')

In [150]:
nltk.sem.extract_rels('GPE','GPE',ne_chunk(sent),pattern=IN, window=5)

[defaultdict(str,
             {'filler': 'is/VBZ really/RB in/IN',
              'lcon': '',
              'objclass': 'GPE',
              'objsym': 'asia',
              'objtext': 'Asia/NNP',
              'rcon': 'and/CC not/RB presented./JJ',
              'subjclass': 'GPE',
              'subjsym': 'china',
              'subjtext': 'China/NNP',
              'untagged_filler': 'is really in'})]

In [162]:
for rel in  nltk.sem.relextract.extract_rels('GPE','GPE',list(ne_chunk_sents(sent)),corpus='ace',pattern=IN, window=5):
   print(nltk.sem.relextract.rtuple(rel))

#12. https://python-forum.io/Thread-floating-point-addition#message

In [20]:
import sys

In [None]:
In general, `sys.float_info[-3]` is `epsilon` value for floating point numbers representation in memory. That means the next floating
point number (correctly represented in memory for a given precision,e.g. double) greater 1.0 is 1.0 + epsilon. Roughly, for a given number `x`, 
the next floating point number would be `x+x*epsilon`.

[quote="Skaperen" pid="84758" dateline="1561750450"]i hope that == for comparing float to float always compares for exact equal.[/quote]
That is not always true, e.g. `1.1 / 3.3 == 1.0 / 3.0` returns `False`, that is why numpy package has utility function `numpy.isclose` 
to compare floating point numbers.

#13. https://python-forum.io/Thread-Django-How-to-automatically-substitute-a-variable-in-the-admin-page-at-Django-1-11

In [None]:
from django.db import models

class City(models.Model):
    name = models.CharField(max_length=10, default='', blank=False)
    area = models.FloatField(default=0.0, blank=True)
    
    def __str__(self):
        return self.name

class Person(models.Model):
    first_name = models.CharField(max_length=10, default='', blank=False)
    where_from = models.ForeingKey(City, null=True, blank=True, on_delete=models.CASCADE)
  
    def __str__(self):
        return self.first_name 

# define other models here if needed   
    
    
#admin.py
#register all models as usual



# views.py
# Let you include Person instance (named instance) to current context and pass 
# this context to template rendering; So, you can get the city name
# in your template, as follows: 
# ---- some template code
# My name is {{ instance.name}}. I am from {{instance.where_from}}. I was born
# in {{instance.where_from.name}}. (This is alternative way to get city's name,
# we don't need to use it, once we've a __str__ method defined in City model)